Clumpify Duplicate Removal Guide

Overview

Duplicate removal is critical for accurate genomic analysis, but one size does NOT fit all. Incorrect deduplication strategies can:

Reduce quantitative accuracy by removing valid reads in RNA-seq experiments
Compromise variant calling by allowing PCR errors to appear as real mutations
Inflate assembly resource usage and reduce contiguity

This guide provides evidence-based recommendations for choosing the correct Clumpify parameters based on your experimental design.

Goal

Remove true PCR duplicates (clones) while protecting coincidental duplicates (identical-but-real reads). This distinction is important for quantitative experiments like RNA-seq expression measurement, where removing the wrong reads can invalidate your results.

Key Terms

UMI: Unique Molecular Identifiers (molecular barcodes)
Quantitative: Experiment where read counts matter (e.g., RNA-seq expression, DAP-seq, ChIP-seq)
PCR: PCR amplification of the library
optical: Clumpify mode that only removes duplicates physically nearby on the flowcell
dupedist: Maximum pixel distance for optical duplicate detection (platform-specific, see table below)
spany: Boolean flag (spany=t) for optical duplicate detection across different tiles along the Y-axis (required for NextSeq tile-edge duplicates)

Platform-Specific Optical Duplicate Distance

Sequencing Platform	Recommended Parameter
NextSeq	`dupedist=40 spany=t`
HiSeq 1T	`dupedist=40`
HiSeq 2500	`dupedist=40`
HiSeq 3000/4000	`dupedist=2500`
NovaSeq 6000	`dupedist=12000`
NovaSeqX+	`dupedist=50`
Other/Unknown	`dupedist=40` (conservative default)

Note: NextSeq requires the spany=t flag to detect tile-edge duplicates along the Y-axis.

Recommended Parameters by Library Type

UMI	Quantitative	PCR	Recommended Clumpify Parameters
Yes	Yes	Yes	`umi umisubs=2`
Yes	Yes	No	`umi umisubs=2 optical dupedist=<platform>`
Yes	No	Yes	`umi umisubs=3`
Yes	No	No	`umi umisubs=3 optical dupedist=<platform>`
No	Yes	Yes	Failed experiment design - do not do this
No	Yes	No	No duplicate removal
No	No	Yes	`optical dupedist=<platform>`
No	No	No	`optical dupedist=<platform>`

Note: For quantitative experiments, umisubs=2 provides stricter UMI matching to maximize protection of coincidental duplicates. For non-quantitative experiments, umisubs=3 allows more tolerance. Platform-specific dupedist values (see table above) should be specified when using optical mode.

Why One Setting Doesn't Work for All Projects

For a randomly-fragmented, high-coverage genome assembly, the error rates from suboptimal deduplication may seem acceptable:

0.81% False Positive Rate (FPR): Incorrectly removing unique reads
3.3% False Negative Rate (FNR): Failing to remove real PCR clones

However, these error rates can significantly impact accuracy in other common experiments.

1. The False Positive Problem (Critical for Quantitative Data)

What It Is

A "false positive" duplicate is when the pipeline removes a unique, original read, misidentifying it as a clone.

Why it becomes problematic: This 0.81% error rate can increase substantially in quantitative experiments such as:

Transcriptome expression (RNA-seq)
Population abundance studies
DAP-seq
ChIP-seq

The error rate also increases in experiments that are non-randomly fragmented (viral studies, transcriptomics) due to short molecules prior to shearing, which constrain possible start/stop coordinates.

The Cause: In an RNA-seq library, thousands of different molecules can legitimately start at the exact same position - the start of a gene, or anywhere in highly-expressed transcripts.

The Result: A deduplicator without UMIs identifies all these valid reads as "duplicates" based on identical start/stop positions and removes them. This reduces quantitative accuracy, making highly-expressed genes appear to have lower expression. The depth reduction is difficult to correct for, as it's a complex function of insert size, gene length, transcript age, and expression level.

2. The False Negative Problem (Critical for Assembly & Variant Calling)

What It Is

A "false negative" is when the pipeline fails to identify and remove a real PCR clone, leaving it in the data.

Impact on Metagenome Assemblies:

Kmer-depth filtering separates real organisms from sequencing noise. Undetected clones (false negatives) make noise from unassembleable organisms under depth 1 (often the majority of the library) appear as useful signal. This:

Increases memory usage during assembly
Populates final output with short, low-quality contigs
Reduces error-correction accuracy
Creates spurious branches on the assembly graph that reduce contiguity

Impact on Variant Calling:

A single PCR error can create a spurious variant. If that read is then amplified 100 times and the clones remain undetected, the variant caller will observe strong supporting evidence and call a mutation that was never present in the original sample.

This is especially problematic when detecting:

Somatic mutations in cancer genomics
Rare variants in population studies
Low-depth variants in any context
Variants in polyploids

Note: This concern is less relevant for high-depth, haploid isolates but critical for the applications listed above.

Supporting Data

These recommendations are based on comprehensive testing with PCR-free, ~1500x depth, 2x151bp randomly-fragmented paired-end sequencing of a Pedobacter heparinus bacterial isolate with 9bp UMIs on NovaSeqX+. Key findings:

Why UMIs Are Vital

Full deduplication (ignoring UMIs) found 26.648% duplicates. The same scan with UMI awareness (allowing 3 mismatches) found only 25.837% duplicates.

Conclusion: 0.81% of reads were "coincidental duplicates" (identical sequences with different UMIs) that would have been incorrectly removed without UMI checking.

Why Optical-Only Isn't Always Sufficient

Optical-only mode found 22.542% duplicates, ~3.3% less than the full UMI-aware scan.

Conclusion: Optical-only detection misses many true clonal duplicates that aren't physically nearby on the flowcell, making it less effective for non-quantitative PCR libraries where complete duplicate removal is desired.

Detailed Test Results

Run Type	UMI Subs	Seq Subs	Optical	Duplicates %
No-UMI	N/A	3	No	26.648%
No-UMI	N/A	3	Yes	22.542%
UMI	0	3	No	24.248%
UMI	1	3	No	25.766%
UMI	2	3	No	25.821%
UMI	3	3	No	25.837%
UMI	4	3	No	25.868%
UMI	5	3	No	25.954%
UMI	6	3	No	26.127%
UMI	8	3	No	26.575%
UMI	9	3	No	26.648%
UMI	0	3	Yes	21.371%
UMI	1	3	Yes	22.486%
UMI	3	3	Yes	22.536%
UMI	3	3	Yes	22.536%
UMI	4	3	Yes	22.539%
UMI	5	3	Yes	22.540%
UMI	6	3	Yes	22.541%
UMI	7	3	Yes	22.542%
UMI	8	3	Yes	22.542%
UMI	9	3	Yes	22.542%

Note: Highlighted rows show recommended umisubs values for non-quantitative applications with 9bp UMIs. For quantitative experiments, use umisubs=1 or 2 (stricter matching) to maximize protection of coincidental duplicates.

Usage Examples

Note: The following examples use dupedist=50 for NovaSeqX+. This value is platform-specific - see the platform table above for appropriate values for other sequencers.

RNA-seq with UMIs (PCR-amplified)

clumpify.sh in=reads.fq.gz out=deduped.fq.gz umi umisubs=2

RNA-seq with UMIs (PCR-free)

clumpify.sh in=reads.fq.gz out=deduped.fq.gz umi umisubs=2 optical dupedist=50

Genome Assembly without UMIs

clumpify.sh in=reads.fq.gz out=deduped.fq.gz optical dupedist=50

NextSeq Data

clumpify.sh in=reads.fq.gz out=deduped.fq.gz optical dupedist=40 spany=t

Note: NextSeq requires spany=t flag to detect tile-edge duplicates along the Y-axis

Cancer Variant Calling with UMIs (non-quantitative)

clumpify.sh in=tumor.fq.gz out=deduped.fq.gz umi umisubs=3

Metagenome Assembly with UMIs

clumpify.sh in=meta.fq.gz out=deduped.fq.gz umi umisubs=3

Additional Resources

Clumpify Documentation - Complete parameter reference
All BBTools - Browse the complete BBTools suite
GitHub Repository - Source code and issue tracker

Important: Scope of Recommendations

Overview

Goal

Key Terms

Platform-Specific Optical Duplicate Distance

Recommended Parameters by Library Type

Why One Setting Doesn't Work for All Projects

1. The False Positive Problem (Critical for Quantitative Data)

What It Is

2. The False Negative Problem (Critical for Assembly & Variant Calling)

What It Is

Supporting Data

Why UMIs Are Vital

Why Optical-Only Isn't Always Sufficient

Detailed Test Results

Usage Examples

RNA-seq with UMIs (PCR-amplified)

RNA-seq with UMIs (PCR-free)

Genome Assembly without UMIs

NextSeq Data

Cancer Variant Calling with UMIs (non-quantitative)

Metagenome Assembly with UMIs

Additional Resources