CountDuplicates

Script: countduplicates.sh Package: jgi Class: CountDuplicates.java

Counts duplicate sequences probabilistically, using around 20 bytes per unique read. Read pairs are treated as a single read. Reads are converted to a hashcode and only the hashcode is stored when tracking duplicates, so (rare) hash collisions will result in false positive duplicate detection. Optionally outputs the deduplicated and/or duplicate reads.

Basic Usage

countduplicates.sh in=<input file>

Input may be fasta, fastq, or sam, compressed or uncompressed. The in2, out2, and outd2 parameters are accepted for paired files.

Parameters

CountDuplicates parameters are organized by their function in the duplicate detection process. The tool uses probabilistic hashing to identify duplicates while maintaining memory efficiency.

Standard parameters

in=<file>: Primary input, or read 1 input. Accepts fasta, fastq, or sam formats, compressed or uncompressed.
out=<file>: Optional output for deduplicated reads. Only reads determined to be unique (not duplicates) will be written to this file.
outd=<file>: Optional output for duplicate reads. An extension like .fq will output full reads; .txt will output headers only. Contains reads identified as duplicates.
stats=stdout: Destination for statistics output. May be replaced by a filename to write stats to a file instead of stdout. Statistics include duplicate fraction, duplication rate, and copy count distribution.
showspeed=t: (ss) Set to 'f' to suppress display of processing speed during execution.

Processing parameters (these are NOT mutually exclusive)

bases=t: Include bases when generating hashcodes. When true, sequence content is used to identify duplicates. This is the primary method for duplicate detection.
names=f: Include names (headers) when generating hashcodes. When true, read names are incorporated into the hash function for duplicate detection.
qualities=f: Include qualities when generating hashcodes. When true, quality scores are used in addition to sequence content for duplicate identification.
maxfraction=-1.0: Set to a positive number 0-1 to FAIL input that exceeds this fraction of reads with duplicates. The tool will exit with the specified fail code if the duplicate fraction exceeds this threshold. Negative values disable this check.
maxrate=-1.0: Set to a positive number >=1 to FAIL input that exceeds this average duplication rate (the number of copies per read). Values >=1 indicate the average number of copies per unique sequence. Negative values disable this check.
failcode=0: Set to some other number like 1 to produce a non-zero exit code for failed input. Used when maxfraction or maxrate thresholds are exceeded.
samplerate=1.0: Fraction of reads to subsample, to conserve memory. Sampling is deterministic - if a read is sampled, copies will be too. Unsampled reads are not sent to any output stream or counted in statistics. Values between 0 and 1 reduce memory usage proportionally.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default is 4GB for CountDuplicates.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions. Can provide minor performance improvement in production environments.

Examples

Basic Duplicate Counting

countduplicates.sh in=reads.fq stats=dup_stats.txt

Count duplicates in reads.fq and write statistics to dup_stats.txt. No output files are generated, only statistics.

Deduplication with Output

countduplicates.sh in=reads.fq out=unique.fq outd=duplicates.fq

Remove duplicates, writing unique reads to unique.fq and duplicate reads to duplicates.fq.

Quality-Based Duplicate Detection

countduplicates.sh in=reads.fq bases=t qualities=t out=unique.fq

Use both sequence content and quality scores for duplicate detection. More stringent than bases-only.

Memory Conservation with Sampling

countduplicates.sh in=large_dataset.fq samplerate=0.1 stats=sample_stats.txt

Process only 10% of reads to reduce memory usage while still getting duplicate statistics.

Quality Control with Thresholds

countduplicates.sh in=reads.fq maxfraction=0.8 failcode=1

Fail with exit code 1 if more than 80% of reads are duplicates, indicating potential library preparation issues.

Algorithm Details

CountDuplicates implements probabilistic duplicate detection using KmerTableSet hash table structures with 12-mer precision for memory-constrained duplicate counting.

Hash-Based Duplicate Detection

The algorithm converts each read into a hash code using the hash() method with multi-component XOR operations:

Base hashing: Processes sequence bytes through hash() function using pre-computed hashcodes array with 32 modes, salt initialization, and Long.rotateLeft(code, 1) bit rotation
Name hashing: Applies hash() to read identifier bytes with offset parameter 7
Quality hashing: Processes quality scores through same hash() function with paired-read offset handling
Pair handling: Combines paired reads using XOR and Long.rotateRight(hash, 3) with offset parameters (0 for R1, 1 for R2)

Memory Implementation

Achieves ~20 bytes per unique read through KmerTableSet architecture:

KmerTableSet with 12-mer precision: Uses tables.allocateTables() to create HashArray1D structures for distributed hash storage
Hash code storage only: Stores only Long.MAX_VALUE-masked hash codes, discarding sequence content after hashing
HashArray1D increment operations: Uses table.increment(code, 1) for atomic count updates without storing duplicates
Bit-mask sampling: Implements (code&sampleMask)<sampleThresh with sampleMask=1023 for deterministic subsampling

Collision Resistance

Hash collision mitigation through multiple algorithmic features:

64-bit hash space using Long data type with MAX_VALUE masking
Random salt initialization (salt=new Random(173).nextLong()) prevents systematic patterns
Rotation-based mixing: Long.rotateLeft(code, 1) and mode-dependent hashcode lookup prevent clustering
Collision probability: ~1/(2^63) per hash comparison given uniform distribution

Threading Architecture

Thread capping: Shared.capThreads(20) automatically limits threads based on system capabilities
Critical section synchronization: Uses synchronized(tables) for thread-safe HashArray1D.increment() operations
Thread-local processing: Each ProcessThread maintains separate LongList codes buffer (Shared.bufferLen() size)
Lock-free hashing: Hash computation via hashList() occurs outside synchronized blocks for parallelization

Statistical Analysis

Generates quantitative metrics through tables.fillHistogram() analysis:

Duplication fraction: Calculated as (sumTotal-counts[1])/sumTotal where counts[1] represents unique reads
Average duplication rate: Computed as sumTotal/sumUnique ratio from histogram analysis
Copy distribution: Uses modulo arithmetic (i%j==0) to calculate multiplicity arrays for 2x, 3x, etc. duplicates
Memory usage tracking: Reports processing statistics through Tools.timeReadsBasesProcessed()

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org