CountDuplicates

Script: countduplicates.sh Package: jgi Class: CountDuplicates.java

Counts duplicate sequences probabilistically, using around 20 bytes per unique read. Read pairs are treated as a single read. Reads are converted to a hashcode and only the hashcode is stored when tracking duplicates, so (rare) hash collisions will result in false positive duplicate detection. Optionally outputs the deduplicated and/or duplicate reads.

Basic Usage

countduplicates.sh in=<input file>

Input may be fasta, fastq, or sam, compressed or uncompressed. The in2, out2, and outd2 parameters are accepted for paired files.

Parameters

CountDuplicates parameters are organized by their function in the duplicate detection process. The tool uses probabilistic hashing to identify duplicates while maintaining memory efficiency.

Standard parameters

in=<file>
Primary input, or read 1 input. Accepts fasta, fastq, or sam formats, compressed or uncompressed.
out=<file>
Optional output for deduplicated reads. Only reads determined to be unique (not duplicates) will be written to this file.
outd=<file>
Optional output for duplicate reads. An extension like .fq will output full reads; .txt will output headers only. Contains reads identified as duplicates.
stats=stdout
Destination for statistics output. May be replaced by a filename to write stats to a file instead of stdout. Statistics include duplicate fraction, duplication rate, and copy count distribution.
showspeed=t
(ss) Set to 'f' to suppress display of processing speed during execution.

Processing parameters (these are NOT mutually exclusive)

bases=t
Include bases when generating hashcodes. When true, sequence content is used to identify duplicates. This is the primary method for duplicate detection.
names=f
Include names (headers) when generating hashcodes. When true, read names are incorporated into the hash function for duplicate detection.
qualities=f
Include qualities when generating hashcodes. When true, quality scores are used in addition to sequence content for duplicate identification.
maxfraction=-1.0
Set to a positive number 0-1 to FAIL input that exceeds this fraction of reads with duplicates. The tool will exit with the specified fail code if the duplicate fraction exceeds this threshold. Negative values disable this check.
maxrate=-1.0
Set to a positive number >=1 to FAIL input that exceeds this average duplication rate (the number of copies per read). Values >=1 indicate the average number of copies per unique sequence. Negative values disable this check.
failcode=0
Set to some other number like 1 to produce a non-zero exit code for failed input. Used when maxfraction or maxrate thresholds are exceeded.
samplerate=1.0
Fraction of reads to subsample, to conserve memory. Sampling is deterministic - if a read is sampled, copies will be too. Unsampled reads are not sent to any output stream or counted in statistics. Values between 0 and 1 reduce memory usage proportionally.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default is 4GB for CountDuplicates.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions. Can provide minor performance improvement in production environments.

Examples

Basic Duplicate Counting

countduplicates.sh in=reads.fq stats=dup_stats.txt

Count duplicates in reads.fq and write statistics to dup_stats.txt. No output files are generated, only statistics.

Deduplication with Output

countduplicates.sh in=reads.fq out=unique.fq outd=duplicates.fq

Remove duplicates, writing unique reads to unique.fq and duplicate reads to duplicates.fq.

Quality-Based Duplicate Detection

countduplicates.sh in=reads.fq bases=t qualities=t out=unique.fq

Use both sequence content and quality scores for duplicate detection. More stringent than bases-only.

Memory Conservation with Sampling

countduplicates.sh in=large_dataset.fq samplerate=0.1 stats=sample_stats.txt

Process only 10% of reads to reduce memory usage while still getting duplicate statistics.

Quality Control with Thresholds

countduplicates.sh in=reads.fq maxfraction=0.8 failcode=1

Fail with exit code 1 if more than 80% of reads are duplicates, indicating potential library preparation issues.

Algorithm Details

CountDuplicates implements probabilistic duplicate detection using KmerTableSet hash table structures with 12-mer precision for memory-constrained duplicate counting.

Hash-Based Duplicate Detection

The algorithm converts each read into a hash code using the hash() method with multi-component XOR operations:

Memory Implementation

Achieves ~20 bytes per unique read through KmerTableSet architecture:

Collision Resistance

Hash collision mitigation through multiple algorithmic features:

Threading Architecture

Statistical Analysis

Generates quantitative metrics through tables.fillHistogram() analysis:

Support

For questions and support: