Dedupe

Script: dedupe.sh Package: jgi Class: Dedupe.java

Eliminates duplicate contigs in assemblies and finds contained and overlapping sequences. Supports clustering sequences based on similarity, printing dot-formatted overlap graphs, and performing set operations on sequence collections.

Overview

Dedupe was originally designed to eliminate duplicate contigs in assemblies, particularly for overlap-based assemblers and merged kmer-based assemblies that contain massive redundancy. It has since expanded to handle all contained and overlapping sequences in datasets, with support for configurable substitutions, edit distance, clustering, and comprehensive set operations.

While kmer-based assemblers typically do not create redundant contigs when working correctly, overlap-based assemblers often do, and merged assemblies (such as 5 assemblies of the same reads with different kmer lengths) usually contain massive redundancy. Public databases like nt and RefSeq also contain hundreds of thousands of duplicate sequences due to poor curation.

Basic Usage

dedupe.sh in=<file or stdin> out=<file or stdout>

Input may be fasta or fastq, compressed or uncompressed. Output may be stdout or a file. With no output parameter, data will be written to stdout. If 'out=null', there will be no output, but statistics will still be printed. You can also use 'dedupe <infile> <outfile>' without the 'in=' and 'out='.

Clustering Example

dedupe.sh in=x.fq am=f ac=f fo c pc rnc=f mcs=4 mo=100 s=1 pto cc qin=33 csf=stats.txt pattern=cluster_%.fq dot=graph.dot

This example finds overlaps allowing 1 substitution, clusters sequences with minimum overlap of 100bp, writes clusters of size ≥4 to individual files, and generates an overlap graph for visualization with graphviz.

Processing Architecture

Dedupe uses a 6-phase processing architecture. The phases are always executed (or skipped) in the same order, with most being optional depending on the processing mode:

1. Exact Matches (Required)

Sequences are loaded into memory and exact duplicates (including reverse-complements) are detected using hash codes. Hashtables are filled with sequence hash codes. If containments or overlaps will be examined later, kmers from sequence ends are added to hash tables. Input files are not used again after this phase.

2. Absorb Containments (Default: On)

When absorbcontainments=t, each sequence X is scanned for kmers. Each kmer is looked up in a hashtable to find other sequences Y. If Y is contained in X (equal/shorter length with substitutions/edits within specified limits), Y is discarded.

3. Find Overlaps (Default: Off)

When findoverlaps=t, overlaps are sought using the same kmer process as containment, but sequences need only overlap by minoverlap length (default 200bp) rather than full containment. Overlaps are recorded but sequences are not merged.

4. Make Clusters (Default: Off)

When cluster=t, clusters are created by searching the overlap graph. Each cluster contains all sequences reachable via transitive overlaps. If X overlaps Y and Y overlaps Z, then X, Y, and Z form one cluster even if X doesn't directly overlap Z.

5. Process Clusters (Default: Off)

When processclusters=t, clusters undergo postprocessing for graph simplification: removing redundant edges, flipping sequences for consistent orientation, and other configurable operations.

6. Output

All output files are generated, including deduplicated sequences, cluster files, statistics, and dot graphs.

Parameters

Parameters are organized by their function in the deduplication process, exactly as they appear in the shell script usage function.

I/O parameters

in=<file,file>
A single file or a comma-delimited list of files.
out=<file>
Destination for all output contigs.
pattern=<file>
Clusters will be written to individual files, where the '%' symbol in the pattern is replaced by cluster number.
outd=<file>
Optional; removed duplicates will go here.
csf=<file>
(clusterstatsfile) Write a list of cluster names and sizes.
dot=<file>
(graph) Write a graph in dot format. Requires 'fo' and 'pc' flags.
threads=auto
(t) Set number of threads to use; default is number of logical processors.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file.
showspeed=t
(ss) Set to 'f' to suppress display of processing speed.
minscaf=0
(ms) Ignore contigs/scaffolds shorter than this.
interleaved=auto
If true, forces fastq input to be paired and interleaved.
ziplevel=2
Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Output format parameters

storename=t
(sn) Store scaffold names (set false to save memory).
storequality=t
(sq) Store quality values for fastq assemblies (set false to save memory).
uniquenames=t
(un) Ensure all output scaffolds have unique names. Uses more memory.
mergenames=f
When a sequence absorbs another, concatenate their headers.
mergedelimiter=>
Delimiter between merged headers. Can be a symbol name like greaterthan.
numbergraphnodes=t
(ngn) Label dot graph nodes with read numbers rather than read names.
sort=f
Sort output (otherwise it will be random). Options:
• length: Sort by length
• quality: Sort by quality
• name: Sort by name
• id: Sort by input order
ascending=f
Sort in ascending order.
ordered=f
Output sequences in input order. Equivalent to sort=id ascending.
renameclusters=f
(rnc) Rename contigs to indicate which cluster they are in.
printlengthinedges=f
(ple) Print the length of contigs in edges.

Processing parameters

absorbrc=t
(arc) Absorb reverse-complements as well as normal orientation.
absorbmatch=t
(am) Absorb exact matches of contigs.
absorbcontainment=t
(ac) Absorb full containments of contigs.
findoverlap=f
(fo) Find overlaps between contigs (containments and non-containments). Necessary for clustering.
uniqueonly=f
(uo) If true, all copies of duplicate reads will be discarded, rather than keeping 1.
rmn=f
(requirematchingnames) If true, both names and sequence must match.
usejni=f
(jni) Do alignments in C code, which is faster, if an edit distance is allowed. This will require compiling the C code; details are in /jni/README.txt.

Subset parameters

subsetcount=1
(sstc) Number of subsets used to process the data; higher uses less memory.
subset=0
(sst) Only process reads whose ((ID%subsetcount)==subset).

Clustering parameters

cluster=f
(c) Group overlapping contigs into clusters.
pto=f
(preventtransitiveoverlaps) Do not look for new edges between nodes in the same cluster.
minclustersize=1
(mcs) Do not output clusters smaller than this.
pbr=f
(pickbestrepresentative) Only output the single highest-quality read per cluster.

Cluster postprocessing parameters

processclusters=f
(pc) Run the cluster processing phase, which performs the selected operations in this category. For example, pc AND cc must be enabled to perform cc.
fixmultijoins=t
(fmj) Remove redundant overlaps between the same two contigs.
removecycles=t
(rc) Remove all cycles so clusters form trees.
cc=t
(canonicizeclusters) Flip contigs so clusters have a single orientation.
fcc=f
(fixcanoncontradictions) Truncate graph at nodes with canonization disputes.
foc=f
(fixoffsetcontradictions) Truncate graph at nodes with offset disputes.
mst=f
(maxspanningtree) Remove cyclic edges, leaving only the longest edges that form a tree.

Overlap Detection Parameters

exact=t
(ex) Only allow exact symbol matches. When false, an 'N' will match any symbol.
touppercase=t
(tuc) Convert input bases to upper-case; otherwise, lower-case will not match.
maxsubs=0
(s) Allow up to this many mismatches (substitutions only, no indels). May be set higher than maxedits.
maxedits=0
(e) Allow up to this many edits (subs or indels). Higher is slower.
minidentity=100
(mid) Absorb contained sequences with percent identity of at least this (includes indels).
minlengthpercent=0
(mlp) Smaller contig must be at least this percent of larger contig's length to be absorbed.
minoverlappercent=0
(mop) Overlap must be at least this percent of smaller contig's length to cluster and merge.
minoverlap=200
(mo) Overlap must be at least this long to cluster and merge.
depthratio=0
(dr) When non-zero, overlaps will only be formed between reads with a depth ratio of at most this. Should be above 1. Depth is determined by parsing the read names; this information can be added by running KmerNormalize (khist.sh, bbnorm.sh, or ecc.sh) with the flag 'rename'
k=31
Seed length used for finding containments and overlaps. Anything shorter than k will not be found.
numaffixmaps=1
(nam) Number of prefixes/suffixes to index per contig. Higher is more sensitive, if edits are allowed.
hashns=f
Set to true to search for matches using kmers containing Ns. Can lead to extreme slowdown in some cases.

Other Parameters

qtrim=f
Set to qtrim=rl to trim leading and trailing Ns.
trimq=6
Quality trim level.
forcetrimleft=-1
(ftl) If positive, trim bases to the left of this position (exclusive, 0-based).
forcetrimright=-1
(ftr) If positive, trim bases to the right of this position (exclusive, 0-based).

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
This flag will disable assertions.

Common Use Cases

Exact Duplicate Removal Only

dedupe.sh in=assembly.fa out=unique.fa ac=f

The "ac=f" flag disables containment removal, keeping only exact duplicate removal.

Finding Duplicate Sequences

dedupe.sh in=sequences.fa out=unique.fa outd=duplicates.fa

All removed sequences will end up in "duplicates.fa" for analysis.

Deduplication of Multiple Files

dedupe.sh in=file1.fa,file2.fa,file3.fa out=merged_unique.fa

Using comma-separated input allows each file to be read in different streams, increasing throughput compared to concatenation.

Allowing Mismatches

dedupe.sh in=data.fa out=unique.fa s=5 e=2

This allows up to 5 substitutions or 2 edits. The edit distance specifies the banded aligner bandwidth, so more than 2 insertions or deletions would go out of bounds.

Identity-Based Deduplication

dedupe.sh in=sequences.fa out=unique.fa minidentity=99

Considers sequences duplicates if their identity is at least 99%. For 1000bp sequences, this allows up to 10 substitutions. Combine with "e=5" to allow up to 5 indels.

Clustering by Overlap

dedupe.sh in=reads.fq pattern=cluster%.fq ac=f am=f s=1 mo=200 c pc csf=stats.txt outbest=best.fq fo c mcs=3 cc dot=graph.dot

Finds overlaps using minimum overlap length of 200bp and allowing 1 substitution, then clusters reads and writes clusters of size ≥3 to individual files. The best representative of each cluster goes to outbest.fq.

PacBio 16S Amplicon Clustering

# First, filter by length and quality
reformat.sh in=reads_of_insert.fastq out=filtered.fq minlen=1420 maxlen=1640 maq=20 qin=33

# Then cluster allowing up to 26 edits (for ~99% accurate 1540bp amplicons)
dedupe.sh in=filtered.fq csf=stats_e26.txt outbest=best_e26.fq qin=33 usejni=t am=f ac=f fo c rnc=f mcs=3 k=27 mo=1420 ow cc pto nam=4 e=26 pattern=cluster_%.fq dot=graph.dot

This workflow first filters probable chimeras by length, then uses 4 nonoverlapping 27-mers as seeds from each sequence end, with minimum overlap of 1420bp to ensure high-quality clustering.

Set Operations

Dedupe can perform arbitrary set operations (intersection, union, subtraction) using the "uniqueonly" flag, which discards all copies of sequences that have duplicates rather than retaining exactly one copy.

Set Creation

dedupe.sh in=X.fa out=X2.fa ac=f
dedupe.sh in=Y.fa out=Y2.fa ac=f

This ensures X2 and Y2 are proper sets with no duplicates.

Set Union

dedupe.sh in=X.fa,Y.fa out=union.fa ac=f

Set Subtraction

dedupe.sh in=X2.fa,union.fa out=Y2_minus_X2.fa uniqueonly ac=f

Set Intersection

dedupe.sh in=X2.fa,symmetric_difference.fa out=intersection.fa uniqueonly ac=f

Algorithm Details

Memory Usage

Dedupe stores all unique sequences in memory. The cost is approximately 500 bytes per unique sequence, plus the sequences themselves (1 byte per base). The current implementation is strictly limited to 2 billion unique sequences regardless of available memory.

Performance Characteristics

JNI Acceleration

Optional C component (by Jonathan Rood) accelerates overlap and containment detection by at least double when edit distance is allowed. Activate with the "jni" flag after compiling (see /bbmap/jni/README.txt). Highly recommended with "pto" flag for amplicon clustering with edit distance.

Data Structures

The algorithm uses LinkedHashMap<Long, ArrayList<Unit>> for sequence code mapping (capacity 4,000,000) and HashMap<LongM, ArrayList<Unit>> for affix map storage. Hash tables are filled with sequence hash codes and k-mer affixes from sequence ends when containment or overlap detection is enabled.

Edit Distance and Identity

The relationship between edit distance (e), substitutions (s), and identity is nuanced:

Amino Acid Support

Dedupe supports amino acid sequences via the 'amino' flag, which automatically:

Limitations and Considerations

Read Deduplication

While Dedupe can deduplicate read sets (including paired reads), it's not particularly memory-efficient for this role given modern sequencing volumes. A 10M read MiSeq run is easily handled, but a HiSeq lane with 300M reads might require hundreds of GB of RAM. For such cases, sorting-based deduplication methods (like mapping and deduplicating by position) are more efficient.

Paired Read Limitations

When processing paired reads, some operations are restricted to single-threaded processing to avoid non-deterministic output. Pair support is limited to exact matches and overlaps, not containments.

Dedupe vs Dedupe2

Dedupe and Dedupe2 are identical except Dedupe2 allows unlimited affixes (k-mer prefixes/suffixes for seeding overlap detection). This is useful for overlaps with low identity since k-mers require exact matches. More affixes use more memory and slow processing, so use judiciously. The system automatically chooses the appropriate version based on the "nam" parameter.

Support

For questions and support: