Dedupe2
Advanced sequence deduplication and clustering tool designed to eliminate redundant contigs in assemblies, find contained and overlapping sequences, and perform sequence clustering. Identical to Dedupe except supports unlimited kmer affixes for enhanced overlap detection at lower identity thresholds.
Overview
Dedupe2 was designed to eliminate duplicate contigs in assemblies and later expanded to find all contained and overlapping sequences in datasets. It excels at handling massive redundancy in merged kmer-based assemblies and cleaning poorly curated databases like nt and RefSeq.
When to Use Dedupe2 vs Dedupe
Use Dedupe2 when searching for overlaps with relatively low identity, as it supports unlimited kmer prefixes and suffixes (affixes) for seeding overlap detection. Regular Dedupe is limited to 2 affixes per sequence tip. More affixes increase memory usage and processing time but improve sensitivity for detecting distant relationships.
Performance Characteristics
- Memory Usage: ~500 bytes per unique sequence plus sequence data (1 byte per base)
- Threading: Fully multithreaded with near-linear scaling across cores
- I/O Throughput: Up to 500 Mbp/s per input stream when I/O bound
- Sequence Limit: Maximum 2 billion unique sequences regardless of available memory
- Multiple Files: Using comma-separated inputs allows parallel stream processing
Usage
dedupe2.sh in=<file or stdin> out=<file or stdout>
Basic Usage Patterns
# Simple deduplication (exact matches only)
dedupe2.sh in=contigs.fa out=deduped.fa ac=f
# Deduplication with containment removal (default)
dedupe2.sh in=contigs.fa out=deduped.fa
# Find and output duplicate sequences separately
dedupe2.sh in=contigs.fa out=unique.fa outd=duplicates.fa
# Multiple input files for parallel processing
dedupe2.sh in=file1.fa,file2.fa,file3.fa out=merged_deduped.fa
Six-Phase Processing Pipeline
Dedupe2 employs a systematic six-phase approach, with phases executed or skipped based on processing mode:
Phase 1: Exact Matches (Required)
Sequences are loaded into memory and exact duplicates (including reverse-complements) are detected using hash tables. Kmer hash tables are populated for subsequent phases if containment or overlap detection is enabled.
Phase 2: Absorb Containments (Default: Enabled)
When absorbcontainments=t
, sequences are scanned for kmers that occur in other sequences. Candidates are aligned to determine if one sequence fully contains another within specified substitution/edit thresholds.
Phase 3: Find Overlaps (Default: Disabled)
When findoverlaps=t
, overlaps are detected using the same kmer-seeded approach as containments, but sequences need only overlap by at least minoverlap
bases (default 200bp).
Phase 4: Make Clusters (Default: Disabled)
When cluster=t
, clusters are formed by traversing the overlap graph. All sequences reachable via transitive overlaps form a cluster (if X overlaps Y and Y overlaps Z, then X, Y, Z form one cluster).
Phase 5: Process Clusters (Default: Disabled)
When processclusters=t
, clusters undergo graph simplification operations: removing redundant edges, cycle removal, and orientation canonicalization to create consistent cluster structures.
Phase 6: Output Generation
All output files are generated: deduplicated sequences, cluster files, statistics, and graph visualizations based on enabled options.
Parameters
I/O Parameters
- in=<file,file>
- A single file or a comma-delimited list of files.
- out=<file>
- Destination for all output contigs.
- pattern=<file>
- Clusters will be written to individual files, where the '%' symbol in the pattern is replaced by cluster number.
- outd=<file>
- Optional; removed duplicates will go here.
- csf=<file>
- (clusterstatsfile) Write a list of cluster names and sizes.
- dot=<file>
- (graph) Write a graph in dot format. Requires 'fo' and 'pc' flags.
- threads=auto
- (t) Set number of threads to use; default is number of logical processors.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- showspeed=t
- (ss) Set to 'f' to suppress display of processing speed.
- minscaf=0
- (ms) Ignore contigs/scaffolds shorter than this.
- interleaved=auto
- If true, forces fastq input to be paired and interleaved.
- ziplevel=2
- Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
Output Format Parameters
- storename=t
- (sn) Store scaffold names (set false to save memory).
- storequality=t
- (sq) Store quality values for fastq assemblies (set false to save memory).
- uniquenames=t
- (un) Ensure all output scaffolds have unique names. Uses more memory.
- mergenames=f
- When a sequence absorbs another, concatenate their headers.
- mergedelimiter=>
- Delimiter between merged headers. Can be a symbol name like greaterthan.
- numbergraphnodes=t
- (ngn) Label dot graph nodes with read numbers rather than read names.
- sort=f
- Sort output (otherwise it will be random). Options:
- length: Sort by length
- quality: Sort by quality
- name: Sort by name
- id: Sort by input order
- ascending=f
- Sort in ascending order.
- ordered=f
- Output sequences in input order. Equivalent to sort=id ascending.
- renameclusters=f
- (rnc) Rename contigs to indicate which cluster they are in.
- printlengthinedges=f
- (ple) Print the length of contigs in edges.
Processing Parameters
- absorbrc=t
- (arc) Absorb reverse-complements as well as normal orientation.
- absorbmatch=t
- (am) Absorb exact matches of contigs.
- absorbcontainment=t
- (ac) Absorb full containments of contigs.
- findoverlap=f
- (fo) Find overlaps between contigs (containments and non-containments). Necessary for clustering.
- uniqueonly=f
- (uo) If true, all copies of duplicate reads will be discarded, rather than keeping 1.
- rmn=f
- (requirematchingnames) If true, both names and sequence must match.
- usejni=f
- (jni) Do alignments in C code, which is faster, if an edit distance is allowed. Requires compiling C code; see /jni/README.txt.
Subset Parameters
- subsetcount=1
- (sstc) Number of subsets used to process the data; higher uses less memory.
- subset=0
- (sst) Only process reads whose ((ID%subsetcount)==subset).
Clustering Parameters
- cluster=f
- (c) Group overlapping contigs into clusters.
- pto=f
- (preventtransitiveoverlaps) Do not look for new edges between nodes in the same cluster.
- minclustersize=1
- (mcs) Do not output clusters smaller than this.
- pbr=f
- (pickbestrepresentative) Only output the single highest-quality read per cluster.
Cluster Postprocessing Parameters
- processclusters=f
- (pc) Run the cluster processing phase, which performs the selected operations in this category. For example, pc AND cc must be enabled to perform cc.
- fixmultijoins=t
- (fmj) Remove redundant overlaps between the same two contigs.
- removecycles=t
- (rc) Remove all cycles so clusters form trees.
- cc=t
- (canonicizeclusters) Flip contigs so clusters have a single orientation.
- fcc=f
- (fixcanoncontradictions) Truncate graph at nodes with canonization disputes.
- foc=f
- (fixoffsetcontradictions) Truncate graph at nodes with offset disputes.
- mst=f
- (maxspanningtree) Remove cyclic edges, leaving only the longest edges that form a tree.
Overlap Detection Parameters
- exact=t
- (ex) Only allow exact symbol matches. When false, an 'N' will match any symbol.
- touppercase=t
- (tuc) Convert input bases to upper-case; otherwise, lower-case will not match.
- maxsubs=0
- (s) Allow up to this many mismatches (substitutions only, no indels). May be set higher than maxedits.
- maxedits=0
- (e) Allow up to this many edits (subs or indels). Higher is slower.
- minidentity=100
- (mid) Absorb contained sequences with percent identity of at least this (includes indels).
- minlengthpercent=0
- (mlp) Smaller contig must be at least this percent of larger contig's length to be absorbed.
- minoverlappercent=0
- (mop) Overlap must be at least this percent of smaller contig's length to cluster and merge.
- minoverlap=200
- (mo) Overlap must be at least this long to cluster and merge.
- depthratio=0
- (dr) When non-zero, overlaps will only be formed between reads with a depth ratio of at most this. Should be above 1. Depth is determined by parsing read names; can be added by running KmerNormalize with the 'rename' flag.
- k=31
- Seed length used for finding containments and overlaps. Anything shorter than k will not be found.
- numaffixmaps=1
- (nam) KEY FEATURE: Number of prefixes/suffixes to index per contig. Dedupe2 supports unlimited values (regular Dedupe limited to 2). Higher values increase sensitivity for low-identity overlaps but use more memory.
- hashns=f
- Set to true to search for matches using kmers containing Ns. Can lead to extreme slowdown in some cases.
Other Parameters
- qtrim=f
- Set to qtrim=rl to trim leading and trailing Ns.
- trimq=6
- Quality trim level.
- forcetrimleft=-1
- (ftl) If positive, trim bases to the left of this position (exclusive, 0-based).
- forcetrimright=-1
- (ftr) If positive, trim bases to the right of this position (exclusive, 0-based).
Amino Acid Support
- amino
- Enable amino acid processing mode. Changes default kmer length to 10. Disables canonicity and reverse-complement flags. Limits numaffixmaps to 2 per tip.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding autodetection. -Xmx20g specifies 20 GB RAM, -Xmx200m specifies 200 MB. Maximum is typically 85% of physical memory.
- -eoom
- Exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Exact Duplicate Removal Only
dedupe2.sh in=contigs.fa out=deduped.fa ac=f
Remove only exact duplicates. The ac=f
flag disables containment removal.
Exact Duplicate and Contained Sequence Removal
dedupe2.sh in=contigs.fa out=deduped.fa
Default behavior removes exact matches and sequences fully contained within others.
Deduplication with Mismatches
dedupe2.sh in=contigs.fa out=deduped.fa s=5 e=2
Allow up to 5 substitutions or 2 edits. This means 5 substitutions + 0 indels is OK, 2 insertions + 0 substitutions is OK, 2 insertions + 3 substitutions is OK, but 5 insertions would exceed the edit distance bandwidth of 2.
Deduplication with Minimum Identity
dedupe2.sh in=contigs.fa out=deduped.fa minidentity=99
Consider sequences duplicates if identity is ≥99%. For 1000bp sequences, this allows up to 10 substitutions. Add e=5
to allow up to 5 indels within the 10 total mutations allowed.
Advanced Clustering for Short Reads
dedupe2.sh in=reads.fq pattern=cluster%.fq ac=f am=f s=1 mo=200 c pc csf=stats.txt outbest=best.fq fo c mcs=3 cc dot=graph.dot
Complex clustering: find overlaps (fo) using minimum overlap of 200bp (mo) with 1 substitution allowed (s), cluster sequences (c), process clusters (pc), output only clusters of size ≥3 (mcs), canonicalize clusters (cc), and create dot graph visualization.
Enhanced Affix Mapping (Dedupe2 Feature)
dedupe2.sh in=sequences.fa out=deduped.fa nam=4 e=10 k=27 mo=500
Use 4 nonoverlapping 27-mers as seeds from each sequence end (nam=4 k=27), allowing up to 10 edits (e=10). This configuration is useful for detecting low-identity overlaps that would be missed with fewer affixes.
PacBio 16S Amplicon Clustering
# First, filter by length and quality
reformat.sh in=reads_of_insert.fastq out=filtered.fq minlen=1420 maxlen=1640 maq=20 qin=33
# Then cluster with optimized parameters
dedupe2.sh in=filtered.fq csf=stats_e26.txt outbest=best_e26.fq qin=33 usejni=t am=f ac=f fo c rnc=f mcs=3 k=27 mo=1420 ow cc pto nam=4 e=26 pattern=cluster_%.fq dot=graph.dot
Specialized workflow for clustering full-length PacBio 16S reads. First filters chimeras by length (1420-1640bp) and quality (maq=20), then clusters allowing 26 edits (~99% identity for 1540bp amplicons) with minimum overlap of 1420bp and 4 affix maps for sensitivity.
Set Operations
# Create sets (deduplicate first)
dedupe2.sh in=setA.fa out=setA_unique.fa ac=f
dedupe2.sh in=setB.fa out=setB_unique.fa ac=f
# Union
dedupe2.sh in=setA_unique.fa,setB_unique.fa out=union.fa ac=f
# Subtraction (setB - setA)
dedupe2.sh in=setA_unique.fa,union.fa out=setB_minus_setA.fa uniqueonly ac=f
# Intersection
dedupe2.sh in=setA_unique.fa,setB_minus_setA.fa out=intersection.fa uniqueonly ac=f
Perform set operations using the uniqueonly
flag, which discards all copies of sequences that have duplicates rather than keeping one copy.
Amino Acid Processing
dedupe2.sh in=proteins.faa out=deduped_proteins.faa amino k=10
Process protein sequences in amino acid space with default kmer length of 10. Canonicity and reverse-complement flags are automatically disabled.
Algorithm Details
Unlimited Affix Map Architecture
Dedupe2's key innovation is support for unlimited kmer prefixes and suffixes per sequence through a scalable HashMap array structure:
- Array Structure: HashMap<LongM, ArrayList<Unit>>[] affixMaps with configurable size
- Kmer Seeding: Each affix map indexes different kmer positions from sequence termini
- Collision Management: maxAffixCopies parameter controls maximum entries per kmer to prevent memory explosion
- Sensitivity Trade-off: More affixes improve detection of distant relationships but increase memory usage and processing time
Multi-threaded Hash Table Processing
Dedupe2 employs specialized HashThread workers for parallel processing with synchronized merge operations:
- Hash Code Mapping: Primary HashMap<Long, ArrayList<Unit>> using sequence hash codes
- Thread-local Processing: Each worker maintains private hash tables to minimize synchronization
- Merge Coordination: mergeMaps() operations combine thread results with atomic collision counting
- Load Balancing: Unit distribution across threads based on hash code modulo operations
Banded Alignment for Inexact Matching
When edit distance is allowed, Dedupe2 uses BandedAligner for verification with optimized parameters:
- Bandwidth Calculation: 2×maxEdits+1 with customizable override via bandwidth parameter
- Alignment Scoring: Configurable substitution and indel penalties with identity threshold enforcement
- JNI Acceleration: Optional C implementation for alignment-intensive operations (requires compilation)
- Bi-directional Processing: Canonical and reverse-complement orientations when absorbrc=t
Graph-based Clustering Algorithm
For clustering operations, Dedupe2 constructs and processes overlap graphs with sophisticated post-processing:
- Connected Components: Transitive closure algorithm to form clusters from overlap edges
- Cycle Removal: Graph simplification to create tree structures when removecycles=t
- Canonicalization: Orientation consistency across cluster members with conflict resolution
- Maximum Spanning Tree: Edge weight-based tree construction when mst=t
- Quality-based Representative Selection: Length and quality score evaluation for cluster representatives
Memory Management Strategy
Dedupe2 implements several memory optimization techniques for processing large datasets:
- Subset Processing: Modulo-based data partitioning (subsetCount parameter) for memory-constrained environments
- Explicit Cleanup: killAffixMaps() method systematically clears HashMap references between phases
- Invalid Entry Removal: Regular purging of absorbed sequences to prevent memory accumulation
- Optional Data Retention: storeName, storeQuality, and storeSuffix flags control memory usage vs functionality
- Streaming I/O: ConcurrentReadInputStream with compressed file support and buffered processing
Performance Characteristics
- Time Complexity: O(n × m × k) where n=sequences, m=affixmaps, k=average collisions per kmer
- Memory Scaling: Linear with sequence count and affix map number (~500 bytes + sequence data per unique sequence)
- I/O Optimization: Parallel file stream processing when comma-separated inputs are used
- Threading Efficiency: Near-linear scaling up to I/O saturation (typically 8-16 cores)
- Bottleneck Analysis: Usually I/O bound at ~500 Mbp/s per stream for exact matching operations
Memory and Performance Guidelines
Memory Requirements
- Base Memory: ~500 bytes per unique sequence plus sequence storage (1 byte per base)
- Affix Maps: Each additional affix map increases memory usage proportionally
- Large Datasets: Use subset processing for datasets exceeding available memory
- Sequence Limit: Hard limit of 2 billion unique sequences regardless of memory
Performance Optimization
- Threading: Use all available cores for CPU-intensive operations
- Multiple Files: Comma-separated inputs enable parallel I/O streams
- JNI Acceleration: Compile and enable JNI for edit distance operations
- Parameter Tuning: Balance sensitivity (nam) vs memory usage and processing time
Paired Read Limitations
When processing paired reads, some performance restrictions apply:
- Single Threading: Some operations restricted to single thread to ensure deterministic output
- Feature Limitations: Pair support limited to exact matches and overlaps, not containments
- Processing Speed: Paired read processing slower than unpaired reads
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Comprehensive Guide: Please read bbmap/docs/guides/DedupeGuide.txt for detailed information