KmerCountMulti

Script: kmercountmulti.sh Package: jgi Class: KmerCountMulti.java

Estimates cardinality of unique kmers in sequence data. Processes multiple kmer lengths simultaneously using the LogLog probabilistic counting algorithm with 64-bit hash functions and configurable bucket arrays.

Basic Usage

kmercountmulti.sh in=<file> sweep=<20,100,8> out=<histogram output>

KmerCountMulti counts unique kmers across multiple kmer lengths simultaneously, providing estimates of sequence complexity and diversity. The tool uses the LogLog probabilistic data structure with configurable accuracy vs. memory tradeoffs.

Parameters

Parameters are organized by function within the kmer counting workflow. All parameters from the shell script are documented below.

Input/Output Parameters

in=<file>: (in1) Input file, or comma-delimited list of files. Supports FASTA/FASTQ format, gzipped or uncompressed. Can use stdin with in=stdin.
in2=<file>: Optional second file for paired reads. When specified, both files are processed together for kmer counting.
out=<file>: Histogram output destination. Default is stdout. Output format is tab-delimited with columns for kmer length and estimated count.

Kmer Configuration

k=: Comma-delimited list of kmer lengths to use. Example: k=21,25,31. Each kmer length is processed independently and results are reported separately.
sweep=min,max,incr: Use incremented kmer values from min to max with specified increment. For example, sweep=20,26,2 is equivalent to k=20,22,24,26. This is convenient for testing multiple kmer lengths systematically.

Counting Parameters

buckets=2048: Use this many buckets for counting; higher values decrease variance for large datasets. Must be a power of 2. More buckets use more memory but provide better accuracy.
seed=-1: Use this seed for hash functions. A negative number forces a random seed. Setting a specific positive seed ensures reproducible results across runs.
minprob=0: Set to a value between 0 and 1 to exclude kmers with a lower probability of being correct. This filters out likely erroneous kmers based on quality scores.
hashes=1: (ways) Use this many hash functions. More hashes yield greater accuracy, but H hashes takes H times as long to compute. Each hash function provides an independent estimate.

Output Control

stdev=f: (showstdev, showstddev, stddev) Print standard deviations in the output when using multiple hash functions. Provides error estimates for the cardinality calculations.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

File Name Shortcuts

# symbol substitution: The # symbol will be substituted for 1 and 2 in file names. For example: kmercountmulti.sh in=read#.fq is equivalent to kmercountmulti.sh in1=read1.fq in2=read2.fq

Examples

Basic Kmer Counting

kmercountmulti.sh in=reads.fq sweep=15,35,2 out=kmer_histogram.txt

Count kmers of lengths 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35 and output the histogram to a file.

Paired-End Reads

kmercountmulti.sh in1=reads_R1.fq in2=reads_R2.fq k=21,25,31 out=paired_kmers.txt

Process paired-end reads for specific kmer lengths 21, 25, and 31.

High Accuracy Counting

kmercountmulti.sh in=large_dataset.fq k=21 buckets=8192 hashes=3 stdev=t

Use more buckets and multiple hash functions for higher accuracy on large datasets, with standard deviation reporting.

Quality Filtering

kmercountmulti.sh in=reads.fq k=21 minprob=0.8 out=filtered_kmers.txt

Only count kmers with at least 80% probability of being correct based on quality scores.

Using File Name Shortcuts

kmercountmulti.sh in=sample#.fastq sweep=20,30,5

Process sample1.fastq and sample2.fastq for kmer lengths 20, 25, 30.

Algorithm Details

LogLog Probabilistic Counting Implementation

KmerCountMulti uses the LogLog cardinality estimation algorithm implemented in the CardinalityTracker class hierarchy. The core implementation details include:

Hash Function: Uses Tools.hash64shift() for 64-bit hash value generation from kmer sequences
Leading Zero Counting: Applies Long.numberOfLeadingZeros() to hash values for cardinality estimation
Bucket Distribution: Hash values are distributed across buckets using bitwise masking with bucketMask
Atomic Operations: Thread-safe updates using AtomicIntegerArray.compareAndSet() for concurrent processing

Cardinality Calculation Method

The cardinality() method in LogLog.java implements the core estimation algorithm:

Restoration Function: Converts leading zero counts back to approximate original values using bit shifting
Harmonic Mean: Applies compensation factor 0.7213428177 for LogLog algorithm accuracy correction
Multiple Estimates: Combines results from multiple hash functions using weighted averaging
Variance Calculation: Provides standard deviation estimates through Tools.standardDeviation()

MultiLogLog Architecture

The MultiLogLog class coordinates multiple CardinalityTracker instances:

Independent Counters: Each kmer length maintains separate LogLog counters array
Parallel Processing: Multiple hash functions processed simultaneously via mlogArray
Kmer Validation: Uses Kmer.getKbig() to validate and normalize kmer lengths
Memory Layout: Contiguous allocation of counter arrays for cache locality

Threading and Concurrency

The ProcessThread implementation provides concurrent read processing:

Read Distribution: ConcurrentReadInputStream distributes reads across worker threads
Per-Thread Counters: Each thread maintains independent readsProcessedT and basesProcessedT
Lock-Free Updates: AtomicIntegerArray operations eliminate synchronization overhead
Thread Coordination: Uses Thread.join() pattern for completion synchronization

Output Generation

The writeOutput() method produces tab-delimited results:

Multi-Way Averaging: Combines cardinality estimates from multiple hash functions
Weighted vs Arithmetic: Supports both Tools.weightedAverage() and simple averaging
Standard Deviation: Calculates variance as percentage of mean when multiple hashes used
Sorted Results: Arrays.sort() applied before statistical calculations

Quality-Based Filtering Implementation

When minprob is configured, the tool filters kmers through quality score analysis in the CardinalityTracker base class, excluding kmers below the probability threshold before hash calculation.

Memory Usage Characteristics

Memory consumption scales predictably with configuration parameters:

Bucket Arrays: buckets * sizeof(int) * ways * kmer_count bytes for counter storage
Read Buffers: Fixed overhead from ConcurrentReadInputStream buffering
Thread Overhead: Shared.threads() * thread_stack_size for concurrent processing
Atomic Arrays: Additional AtomicIntegerArray overhead when atomic=true

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org