KmerCountMulti

Script: kmercountmulti.sh Package: jgi Class: KmerCountMulti.java

Estimates cardinality of unique kmers in sequence data. Processes multiple kmer lengths simultaneously using the LogLog probabilistic counting algorithm with 64-bit hash functions and configurable bucket arrays.

Basic Usage

kmercountmulti.sh in=<file> sweep=<20,100,8> out=<histogram output>

KmerCountMulti counts unique kmers across multiple kmer lengths simultaneously, providing estimates of sequence complexity and diversity. The tool uses the LogLog probabilistic data structure with configurable accuracy vs. memory tradeoffs.

Parameters

Parameters are organized by function within the kmer counting workflow. All parameters from the shell script are documented below.

Input/Output Parameters

in=<file>
(in1) Input file, or comma-delimited list of files. Supports FASTA/FASTQ format, gzipped or uncompressed. Can use stdin with in=stdin.
in2=<file>
Optional second file for paired reads. When specified, both files are processed together for kmer counting.
out=<file>
Histogram output destination. Default is stdout. Output format is tab-delimited with columns for kmer length and estimated count.

Kmer Configuration

k=
Comma-delimited list of kmer lengths to use. Example: k=21,25,31. Each kmer length is processed independently and results are reported separately.
sweep=min,max,incr
Use incremented kmer values from min to max with specified increment. For example, sweep=20,26,2 is equivalent to k=20,22,24,26. This is convenient for testing multiple kmer lengths systematically.

Counting Parameters

buckets=2048
Use this many buckets for counting; higher values decrease variance for large datasets. Must be a power of 2. More buckets use more memory but provide better accuracy.
seed=-1
Use this seed for hash functions. A negative number forces a random seed. Setting a specific positive seed ensures reproducible results across runs.
minprob=0
Set to a value between 0 and 1 to exclude kmers with a lower probability of being correct. This filters out likely erroneous kmers based on quality scores.
hashes=1
(ways) Use this many hash functions. More hashes yield greater accuracy, but H hashes takes H times as long to compute. Each hash function provides an independent estimate.

Output Control

stdev=f
(showstdev, showstddev, stddev) Print standard deviations in the output when using multiple hash functions. Provides error estimates for the cardinality calculations.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

File Name Shortcuts

# symbol substitution
The # symbol will be substituted for 1 and 2 in file names. For example: kmercountmulti.sh in=read#.fq is equivalent to kmercountmulti.sh in1=read1.fq in2=read2.fq

Examples

Basic Kmer Counting

kmercountmulti.sh in=reads.fq sweep=15,35,2 out=kmer_histogram.txt

Count kmers of lengths 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35 and output the histogram to a file.

Paired-End Reads

kmercountmulti.sh in1=reads_R1.fq in2=reads_R2.fq k=21,25,31 out=paired_kmers.txt

Process paired-end reads for specific kmer lengths 21, 25, and 31.

High Accuracy Counting

kmercountmulti.sh in=large_dataset.fq k=21 buckets=8192 hashes=3 stdev=t

Use more buckets and multiple hash functions for higher accuracy on large datasets, with standard deviation reporting.

Quality Filtering

kmercountmulti.sh in=reads.fq k=21 minprob=0.8 out=filtered_kmers.txt

Only count kmers with at least 80% probability of being correct based on quality scores.

Using File Name Shortcuts

kmercountmulti.sh in=sample#.fastq sweep=20,30,5

Process sample1.fastq and sample2.fastq for kmer lengths 20, 25, 30.

Algorithm Details

LogLog Probabilistic Counting Implementation

KmerCountMulti uses the LogLog cardinality estimation algorithm implemented in the CardinalityTracker class hierarchy. The core implementation details include:

Cardinality Calculation Method

The cardinality() method in LogLog.java implements the core estimation algorithm:

MultiLogLog Architecture

The MultiLogLog class coordinates multiple CardinalityTracker instances:

Threading and Concurrency

The ProcessThread implementation provides concurrent read processing:

Output Generation

The writeOutput() method produces tab-delimited results:

Quality-Based Filtering Implementation

When minprob is configured, the tool filters kmers through quality score analysis in the CardinalityTracker base class, excluding kmers below the probability threshold before hash calculation.

Memory Usage Characteristics

Memory consumption scales predictably with configuration parameters:

Support

For questions and support: