CountDuplicates
Counts duplicate sequences probabilistically, using around 20 bytes per unique read. Read pairs are treated as a single read. Reads are converted to a hashcode and only the hashcode is stored when tracking duplicates, so (rare) hash collisions will result in false positive duplicate detection. Optionally outputs the deduplicated and/or duplicate reads.
Basic Usage
countduplicates.sh in=<input file>
Input may be fasta, fastq, or sam, compressed or uncompressed. The in2, out2, and outd2 parameters are accepted for paired files.
Parameters
CountDuplicates parameters are organized by their function in the duplicate detection process. The tool uses probabilistic hashing to identify duplicates while maintaining memory efficiency.
Standard parameters
- in=<file>
- Primary input, or read 1 input. Accepts fasta, fastq, or sam formats, compressed or uncompressed.
- out=<file>
- Optional output for deduplicated reads. Only reads determined to be unique (not duplicates) will be written to this file.
- outd=<file>
- Optional output for duplicate reads. An extension like .fq will output full reads; .txt will output headers only. Contains reads identified as duplicates.
- stats=stdout
- Destination for statistics output. May be replaced by a filename to write stats to a file instead of stdout. Statistics include duplicate fraction, duplication rate, and copy count distribution.
- showspeed=t
- (ss) Set to 'f' to suppress display of processing speed during execution.
Processing parameters (these are NOT mutually exclusive)
- bases=t
- Include bases when generating hashcodes. When true, sequence content is used to identify duplicates. This is the primary method for duplicate detection.
- names=f
- Include names (headers) when generating hashcodes. When true, read names are incorporated into the hash function for duplicate detection.
- qualities=f
- Include qualities when generating hashcodes. When true, quality scores are used in addition to sequence content for duplicate identification.
- maxfraction=-1.0
- Set to a positive number 0-1 to FAIL input that exceeds this fraction of reads with duplicates. The tool will exit with the specified fail code if the duplicate fraction exceeds this threshold. Negative values disable this check.
- maxrate=-1.0
- Set to a positive number >=1 to FAIL input that exceeds this average duplication rate (the number of copies per read). Values >=1 indicate the average number of copies per unique sequence. Negative values disable this check.
- failcode=0
- Set to some other number like 1 to produce a non-zero exit code for failed input. Used when maxfraction or maxrate thresholds are exceeded.
- samplerate=1.0
- Fraction of reads to subsample, to conserve memory. Sampling is deterministic - if a read is sampled, copies will be too. Unsampled reads are not sent to any output stream or counted in statistics. Values between 0 and 1 reduce memory usage proportionally.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default is 4GB for CountDuplicates.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions. Can provide minor performance improvement in production environments.
Examples
Basic Duplicate Counting
countduplicates.sh in=reads.fq stats=dup_stats.txt
Count duplicates in reads.fq and write statistics to dup_stats.txt. No output files are generated, only statistics.
Deduplication with Output
countduplicates.sh in=reads.fq out=unique.fq outd=duplicates.fq
Remove duplicates, writing unique reads to unique.fq and duplicate reads to duplicates.fq.
Quality-Based Duplicate Detection
countduplicates.sh in=reads.fq bases=t qualities=t out=unique.fq
Use both sequence content and quality scores for duplicate detection. More stringent than bases-only.
Memory Conservation with Sampling
countduplicates.sh in=large_dataset.fq samplerate=0.1 stats=sample_stats.txt
Process only 10% of reads to reduce memory usage while still getting duplicate statistics.
Quality Control with Thresholds
countduplicates.sh in=reads.fq maxfraction=0.8 failcode=1
Fail with exit code 1 if more than 80% of reads are duplicates, indicating potential library preparation issues.
Algorithm Details
CountDuplicates implements probabilistic duplicate detection using KmerTableSet hash table structures with 12-mer precision for memory-constrained duplicate counting.
Hash-Based Duplicate Detection
The algorithm converts each read into a hash code using the hash() method with multi-component XOR operations:
- Base hashing: Processes sequence bytes through hash() function using pre-computed hashcodes array with 32 modes, salt initialization, and Long.rotateLeft(code, 1) bit rotation
- Name hashing: Applies hash() to read identifier bytes with offset parameter 7
- Quality hashing: Processes quality scores through same hash() function with paired-read offset handling
- Pair handling: Combines paired reads using XOR and Long.rotateRight(hash, 3) with offset parameters (0 for R1, 1 for R2)
Memory Implementation
Achieves ~20 bytes per unique read through KmerTableSet architecture:
- KmerTableSet with 12-mer precision: Uses tables.allocateTables() to create HashArray1D structures for distributed hash storage
- Hash code storage only: Stores only Long.MAX_VALUE-masked hash codes, discarding sequence content after hashing
- HashArray1D increment operations: Uses table.increment(code, 1) for atomic count updates without storing duplicates
- Bit-mask sampling: Implements (code&sampleMask)<sampleThresh with sampleMask=1023 for deterministic subsampling
Collision Resistance
Hash collision mitigation through multiple algorithmic features:
- 64-bit hash space using Long data type with MAX_VALUE masking
- Random salt initialization (salt=new Random(173).nextLong()) prevents systematic patterns
- Rotation-based mixing: Long.rotateLeft(code, 1) and mode-dependent hashcode lookup prevent clustering
- Collision probability: ~1/(2^63) per hash comparison given uniform distribution
Threading Architecture
- Thread capping: Shared.capThreads(20) automatically limits threads based on system capabilities
- Critical section synchronization: Uses synchronized(tables) for thread-safe HashArray1D.increment() operations
- Thread-local processing: Each ProcessThread maintains separate LongList codes buffer (Shared.bufferLen() size)
- Lock-free hashing: Hash computation via hashList() occurs outside synchronized blocks for parallelization
Statistical Analysis
Generates quantitative metrics through tables.fillHistogram() analysis:
- Duplication fraction: Calculated as (sumTotal-counts[1])/sumTotal where counts[1] represents unique reads
- Average duplication rate: Computed as sumTotal/sumUnique ratio from histogram analysis
- Copy distribution: Uses modulo arithmetic (i%j==0) to calculate multiplicity arrays for 2x, 3x, etc. duplicates
- Memory usage tracking: Reports processing statistics through Tools.timeReadsBasesProcessed()
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org