Clumpify Duplicate Removal Guide

How to remove PCR duplicates while protecting coincidental duplicates in quantitative experiments

Important: Scope of Recommendations

These recommendations apply to randomly-sheared paired-end data. For fixed-length single-end data or amplicon sequencing, different deduplication strategies may be more appropriate.

Overview

Duplicate removal is critical for accurate genomic analysis, but one size does NOT fit all. Incorrect deduplication strategies can:

This guide provides evidence-based recommendations for choosing the correct Clumpify parameters based on your experimental design.

Goal

Remove true PCR duplicates (clones) while protecting coincidental duplicates (identical-but-real reads). This distinction is important for quantitative experiments like RNA-seq expression measurement, where removing the wrong reads can invalidate your results.

Key Terms

UMI
Unique Molecular Identifiers (molecular barcodes)
Quantitative
Experiment where read counts matter (e.g., RNA-seq expression, DAP-seq, ChIP-seq)
PCR
PCR amplification of the library
optical
Clumpify mode that only removes duplicates physically nearby on the flowcell
dupedist
Maximum pixel distance for optical duplicate detection (platform-specific, see table below)
spany
Boolean flag (spany=t) for optical duplicate detection across different tiles along the Y-axis (required for NextSeq tile-edge duplicates)

Platform-Specific Optical Duplicate Distance

Sequencing Platform Recommended Parameter
NextSeq dupedist=40 spany=t
HiSeq 1T dupedist=40
HiSeq 2500 dupedist=40
HiSeq 3000/4000 dupedist=2500
NovaSeq 6000 dupedist=12000
NovaSeqX+ dupedist=50
Other/Unknown dupedist=40 (conservative default)

Note: NextSeq requires the spany=t flag to detect tile-edge duplicates along the Y-axis.

Recommended Parameters by Library Type

UMI Quantitative PCR Recommended Clumpify Parameters
Yes Yes Yes umi umisubs=2
Yes Yes No umi umisubs=2 optical dupedist=<platform>
Yes No Yes umi umisubs=3
Yes No No umi umisubs=3 optical dupedist=<platform>
No Yes Yes Failed experiment design - do not do this
No Yes No No duplicate removal
No No Yes optical dupedist=<platform>
No No No optical dupedist=<platform>

Note: For quantitative experiments, umisubs=2 provides stricter UMI matching to maximize protection of coincidental duplicates. For non-quantitative experiments, umisubs=3 allows more tolerance. Platform-specific dupedist values (see table above) should be specified when using optical mode.

Why One Setting Doesn't Work for All Projects

For a randomly-fragmented, high-coverage genome assembly, the error rates from suboptimal deduplication may seem acceptable:

However, these error rates can significantly impact accuracy in other common experiments.

1. The False Positive Problem (Critical for Quantitative Data)

What It Is

A "false positive" duplicate is when the pipeline removes a unique, original read, misidentifying it as a clone.

Why it becomes problematic: This 0.81% error rate can increase substantially in quantitative experiments such as:

The error rate also increases in experiments that are non-randomly fragmented (viral studies, transcriptomics) due to short molecules prior to shearing, which constrain possible start/stop coordinates.

The Cause: In an RNA-seq library, thousands of different molecules can legitimately start at the exact same position - the start of a gene, or anywhere in highly-expressed transcripts.

The Result: A deduplicator without UMIs identifies all these valid reads as "duplicates" based on identical start/stop positions and removes them. This reduces quantitative accuracy, making highly-expressed genes appear to have lower expression. The depth reduction is difficult to correct for, as it's a complex function of insert size, gene length, transcript age, and expression level.

2. The False Negative Problem (Critical for Assembly & Variant Calling)

What It Is

A "false negative" is when the pipeline fails to identify and remove a real PCR clone, leaving it in the data.

Impact on Metagenome Assemblies:

Kmer-depth filtering separates real organisms from sequencing noise. Undetected clones (false negatives) make noise from unassembleable organisms under depth 1 (often the majority of the library) appear as useful signal. This:

Impact on Variant Calling:

A single PCR error can create a spurious variant. If that read is then amplified 100 times and the clones remain undetected, the variant caller will observe strong supporting evidence and call a mutation that was never present in the original sample.

This is especially problematic when detecting:

Note: This concern is less relevant for high-depth, haploid isolates but critical for the applications listed above.

Supporting Data

These recommendations are based on comprehensive testing with PCR-free, ~1500x depth, 2x151bp randomly-fragmented paired-end sequencing of a Pedobacter heparinus bacterial isolate with 9bp UMIs on NovaSeqX+. Key findings:

Why UMIs Are Vital

Full deduplication (ignoring UMIs) found 26.648% duplicates. The same scan with UMI awareness (allowing 3 mismatches) found only 25.837% duplicates.

Conclusion: 0.81% of reads were "coincidental duplicates" (identical sequences with different UMIs) that would have been incorrectly removed without UMI checking.

Why Optical-Only Isn't Always Sufficient

Optical-only mode found 22.542% duplicates, ~3.3% less than the full UMI-aware scan.

Conclusion: Optical-only detection misses many true clonal duplicates that aren't physically nearby on the flowcell, making it less effective for non-quantitative PCR libraries where complete duplicate removal is desired.

Detailed Test Results

Run Type UMI Subs Seq Subs Optical Duplicates %
No-UMI N/A 3 No 26.648%
No-UMI N/A 3 Yes 22.542%
UMI 0 3 No 24.248%
UMI 1 3 No 25.766%
UMI 2 3 No 25.821%
UMI 3 3 No 25.837%
UMI 4 3 No 25.868%
UMI 5 3 No 25.954%
UMI 6 3 No 26.127%
UMI 8 3 No 26.575%
UMI 9 3 No 26.648%
UMI 0 3 Yes 21.371%
UMI 1 3 Yes 22.486%
UMI 3 3 Yes 22.536%
UMI 3 3 Yes 22.536%
UMI 4 3 Yes 22.539%
UMI 5 3 Yes 22.540%
UMI 6 3 Yes 22.541%
UMI 7 3 Yes 22.542%
UMI 8 3 Yes 22.542%
UMI 9 3 Yes 22.542%

Note: Highlighted rows show recommended umisubs values for non-quantitative applications with 9bp UMIs. For quantitative experiments, use umisubs=1 or 2 (stricter matching) to maximize protection of coincidental duplicates.

Usage Examples

Note: The following examples use dupedist=50 for NovaSeqX+. This value is platform-specific - see the platform table above for appropriate values for other sequencers.

RNA-seq with UMIs (PCR-amplified)

clumpify.sh in=reads.fq.gz out=deduped.fq.gz umi umisubs=2

RNA-seq with UMIs (PCR-free)

clumpify.sh in=reads.fq.gz out=deduped.fq.gz umi umisubs=2 optical dupedist=50

Genome Assembly without UMIs

clumpify.sh in=reads.fq.gz out=deduped.fq.gz optical dupedist=50

NextSeq Data

clumpify.sh in=reads.fq.gz out=deduped.fq.gz optical dupedist=40 spany=t

Note: NextSeq requires spany=t flag to detect tile-edge duplicates along the Y-axis

Cancer Variant Calling with UMIs (non-quantitative)

clumpify.sh in=tumor.fq.gz out=deduped.fq.gz umi umisubs=3

Metagenome Assembly with UMIs

clumpify.sh in=meta.fq.gz out=deduped.fq.gz umi umisubs=3

Additional Resources