Clumpify Duplicate Removal Guide
How to remove PCR duplicates while protecting coincidental duplicates in quantitative experiments
Important: Scope of Recommendations
These recommendations apply to randomly-sheared paired-end data. For fixed-length single-end data or amplicon sequencing, different deduplication strategies may be more appropriate.
Overview
Duplicate removal is critical for accurate genomic analysis, but one size does NOT fit all. Incorrect deduplication strategies can:
- Reduce quantitative accuracy by removing valid reads in RNA-seq experiments
- Compromise variant calling by allowing PCR errors to appear as real mutations
- Inflate assembly resource usage and reduce contiguity
This guide provides evidence-based recommendations for choosing the correct Clumpify parameters based on your experimental design.
Goal
Remove true PCR duplicates (clones) while protecting coincidental duplicates (identical-but-real reads). This distinction is important for quantitative experiments like RNA-seq expression measurement, where removing the wrong reads can invalidate your results.
Key Terms
- UMI
- Unique Molecular Identifiers (molecular barcodes)
- Quantitative
- Experiment where read counts matter (e.g., RNA-seq expression, DAP-seq, ChIP-seq)
- PCR
- PCR amplification of the library
- optical
- Clumpify mode that only removes duplicates physically nearby on the flowcell
- dupedist
- Maximum pixel distance for optical duplicate detection (platform-specific, see table below)
- spany
- Boolean flag (spany=t) for optical duplicate detection across different tiles along the Y-axis (required for NextSeq tile-edge duplicates)
Platform-Specific Optical Duplicate Distance
| Sequencing Platform | Recommended Parameter |
|---|---|
| NextSeq | dupedist=40 spany=t |
| HiSeq 1T | dupedist=40 |
| HiSeq 2500 | dupedist=40 |
| HiSeq 3000/4000 | dupedist=2500 |
| NovaSeq 6000 | dupedist=12000 |
| NovaSeqX+ | dupedist=50 |
| Other/Unknown | dupedist=40 (conservative default) |
Note: NextSeq requires the spany=t flag to detect tile-edge duplicates along the Y-axis.
Recommended Parameters by Library Type
| UMI | Quantitative | PCR | Recommended Clumpify Parameters |
|---|---|---|---|
| Yes | Yes | Yes | umi umisubs=2 |
| Yes | Yes | No | umi umisubs=2 optical dupedist=<platform> |
| Yes | No | Yes | umi umisubs=3 |
| Yes | No | No | umi umisubs=3 optical dupedist=<platform> |
| No | Yes | Yes | Failed experiment design - do not do this |
| No | Yes | No | No duplicate removal |
| No | No | Yes | optical dupedist=<platform> |
| No | No | No | optical dupedist=<platform> |
Note: For quantitative experiments, umisubs=2 provides stricter UMI matching to maximize protection of coincidental duplicates. For non-quantitative experiments, umisubs=3 allows more tolerance. Platform-specific dupedist values (see table above) should be specified when using optical mode.
Why One Setting Doesn't Work for All Projects
For a randomly-fragmented, high-coverage genome assembly, the error rates from suboptimal deduplication may seem acceptable:
- 0.81% False Positive Rate (FPR): Incorrectly removing unique reads
- 3.3% False Negative Rate (FNR): Failing to remove real PCR clones
However, these error rates can significantly impact accuracy in other common experiments.
1. The False Positive Problem (Critical for Quantitative Data)
What It Is
A "false positive" duplicate is when the pipeline removes a unique, original read, misidentifying it as a clone.
Why it becomes problematic: This 0.81% error rate can increase substantially in quantitative experiments such as:
- Transcriptome expression (RNA-seq)
- Population abundance studies
- DAP-seq
- ChIP-seq
The error rate also increases in experiments that are non-randomly fragmented (viral studies, transcriptomics) due to short molecules prior to shearing, which constrain possible start/stop coordinates.
The Cause: In an RNA-seq library, thousands of different molecules can legitimately start at the exact same position - the start of a gene, or anywhere in highly-expressed transcripts.
The Result: A deduplicator without UMIs identifies all these valid reads as "duplicates" based on identical start/stop positions and removes them. This reduces quantitative accuracy, making highly-expressed genes appear to have lower expression. The depth reduction is difficult to correct for, as it's a complex function of insert size, gene length, transcript age, and expression level.
2. The False Negative Problem (Critical for Assembly & Variant Calling)
What It Is
A "false negative" is when the pipeline fails to identify and remove a real PCR clone, leaving it in the data.
Impact on Metagenome Assemblies:
Kmer-depth filtering separates real organisms from sequencing noise. Undetected clones (false negatives) make noise from unassembleable organisms under depth 1 (often the majority of the library) appear as useful signal. This:
- Increases memory usage during assembly
- Populates final output with short, low-quality contigs
- Reduces error-correction accuracy
- Creates spurious branches on the assembly graph that reduce contiguity
Impact on Variant Calling:
A single PCR error can create a spurious variant. If that read is then amplified 100 times and the clones remain undetected, the variant caller will observe strong supporting evidence and call a mutation that was never present in the original sample.
This is especially problematic when detecting:
- Somatic mutations in cancer genomics
- Rare variants in population studies
- Low-depth variants in any context
- Variants in polyploids
Note: This concern is less relevant for high-depth, haploid isolates but critical for the applications listed above.
Supporting Data
These recommendations are based on comprehensive testing with PCR-free, ~1500x depth, 2x151bp randomly-fragmented paired-end sequencing of a Pedobacter heparinus bacterial isolate with 9bp UMIs on NovaSeqX+. Key findings:
Why UMIs Are Vital
Full deduplication (ignoring UMIs) found 26.648% duplicates. The same scan with UMI awareness (allowing 3 mismatches) found only 25.837% duplicates.
Conclusion: 0.81% of reads were "coincidental duplicates" (identical sequences with different UMIs) that would have been incorrectly removed without UMI checking.
Why Optical-Only Isn't Always Sufficient
Optical-only mode found 22.542% duplicates, ~3.3% less than the full UMI-aware scan.
Conclusion: Optical-only detection misses many true clonal duplicates that aren't physically nearby on the flowcell, making it less effective for non-quantitative PCR libraries where complete duplicate removal is desired.
Detailed Test Results
| Run Type | UMI Subs | Seq Subs | Optical | Duplicates % |
|---|---|---|---|---|
| No-UMI | N/A | 3 | No | 26.648% |
| No-UMI | N/A | 3 | Yes | 22.542% |
| UMI | 0 | 3 | No | 24.248% |
| UMI | 1 | 3 | No | 25.766% |
| UMI | 2 | 3 | No | 25.821% |
| UMI | 3 | 3 | No | 25.837% |
| UMI | 4 | 3 | No | 25.868% |
| UMI | 5 | 3 | No | 25.954% |
| UMI | 6 | 3 | No | 26.127% |
| UMI | 8 | 3 | No | 26.575% |
| UMI | 9 | 3 | No | 26.648% |
| UMI | 0 | 3 | Yes | 21.371% |
| UMI | 1 | 3 | Yes | 22.486% |
| UMI | 3 | 3 | Yes | 22.536% |
| UMI | 3 | 3 | Yes | 22.536% |
| UMI | 4 | 3 | Yes | 22.539% |
| UMI | 5 | 3 | Yes | 22.540% |
| UMI | 6 | 3 | Yes | 22.541% |
| UMI | 7 | 3 | Yes | 22.542% |
| UMI | 8 | 3 | Yes | 22.542% |
| UMI | 9 | 3 | Yes | 22.542% |
Note: Highlighted rows show recommended umisubs values for non-quantitative applications with 9bp UMIs. For quantitative experiments, use umisubs=1 or 2 (stricter matching) to maximize protection of coincidental duplicates.
Usage Examples
Note: The following examples use dupedist=50 for NovaSeqX+. This value is platform-specific - see the platform table above for appropriate values for other sequencers.
RNA-seq with UMIs (PCR-amplified)
clumpify.sh in=reads.fq.gz out=deduped.fq.gz umi umisubs=2
RNA-seq with UMIs (PCR-free)
clumpify.sh in=reads.fq.gz out=deduped.fq.gz umi umisubs=2 optical dupedist=50
Genome Assembly without UMIs
clumpify.sh in=reads.fq.gz out=deduped.fq.gz optical dupedist=50
NextSeq Data
clumpify.sh in=reads.fq.gz out=deduped.fq.gz optical dupedist=40 spany=t
Note: NextSeq requires spany=t flag to detect tile-edge duplicates along the Y-axis
Cancer Variant Calling with UMIs (non-quantitative)
clumpify.sh in=tumor.fq.gz out=deduped.fq.gz umi umisubs=3
Metagenome Assembly with UMIs
clumpify.sh in=meta.fq.gz out=deduped.fq.gz umi umisubs=3
Additional Resources
- Clumpify Documentation - Complete parameter reference
- All BBTools - Browse the complete BBTools suite
- GitHub Repository - Source code and issue tracker