KmerCountMulti
Estimates cardinality of unique kmers in sequence data. Processes multiple kmer lengths simultaneously using the LogLog probabilistic counting algorithm with 64-bit hash functions and configurable bucket arrays.
Basic Usage
kmercountmulti.sh in=<file> sweep=<20,100,8> out=<histogram output>
KmerCountMulti counts unique kmers across multiple kmer lengths simultaneously, providing estimates of sequence complexity and diversity. The tool uses the LogLog probabilistic data structure with configurable accuracy vs. memory tradeoffs.
Parameters
Parameters are organized by function within the kmer counting workflow. All parameters from the shell script are documented below.
Input/Output Parameters
- in=<file>
- (in1) Input file, or comma-delimited list of files. Supports FASTA/FASTQ format, gzipped or uncompressed. Can use stdin with in=stdin.
- in2=<file>
- Optional second file for paired reads. When specified, both files are processed together for kmer counting.
- out=<file>
- Histogram output destination. Default is stdout. Output format is tab-delimited with columns for kmer length and estimated count.
Kmer Configuration
- k=
- Comma-delimited list of kmer lengths to use. Example: k=21,25,31. Each kmer length is processed independently and results are reported separately.
- sweep=min,max,incr
- Use incremented kmer values from min to max with specified increment. For example, sweep=20,26,2 is equivalent to k=20,22,24,26. This is convenient for testing multiple kmer lengths systematically.
Counting Parameters
- buckets=2048
- Use this many buckets for counting; higher values decrease variance for large datasets. Must be a power of 2. More buckets use more memory but provide better accuracy.
- seed=-1
- Use this seed for hash functions. A negative number forces a random seed. Setting a specific positive seed ensures reproducible results across runs.
- minprob=0
- Set to a value between 0 and 1 to exclude kmers with a lower probability of being correct. This filters out likely erroneous kmers based on quality scores.
- hashes=1
- (ways) Use this many hash functions. More hashes yield greater accuracy, but H hashes takes H times as long to compute. Each hash function provides an independent estimate.
Output Control
- stdev=f
- (showstdev, showstddev, stddev) Print standard deviations in the output when using multiple hash functions. Provides error estimates for the cardinality calculations.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
File Name Shortcuts
- # symbol substitution
- The # symbol will be substituted for 1 and 2 in file names. For example: kmercountmulti.sh in=read#.fq is equivalent to kmercountmulti.sh in1=read1.fq in2=read2.fq
Examples
Basic Kmer Counting
kmercountmulti.sh in=reads.fq sweep=15,35,2 out=kmer_histogram.txt
Count kmers of lengths 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35 and output the histogram to a file.
Paired-End Reads
kmercountmulti.sh in1=reads_R1.fq in2=reads_R2.fq k=21,25,31 out=paired_kmers.txt
Process paired-end reads for specific kmer lengths 21, 25, and 31.
High Accuracy Counting
kmercountmulti.sh in=large_dataset.fq k=21 buckets=8192 hashes=3 stdev=t
Use more buckets and multiple hash functions for higher accuracy on large datasets, with standard deviation reporting.
Quality Filtering
kmercountmulti.sh in=reads.fq k=21 minprob=0.8 out=filtered_kmers.txt
Only count kmers with at least 80% probability of being correct based on quality scores.
Using File Name Shortcuts
kmercountmulti.sh in=sample#.fastq sweep=20,30,5
Process sample1.fastq and sample2.fastq for kmer lengths 20, 25, 30.
Algorithm Details
LogLog Probabilistic Counting Implementation
KmerCountMulti uses the LogLog cardinality estimation algorithm implemented in the CardinalityTracker class hierarchy. The core implementation details include:
- Hash Function: Uses Tools.hash64shift() for 64-bit hash value generation from kmer sequences
- Leading Zero Counting: Applies Long.numberOfLeadingZeros() to hash values for cardinality estimation
- Bucket Distribution: Hash values are distributed across buckets using bitwise masking with bucketMask
- Atomic Operations: Thread-safe updates using AtomicIntegerArray.compareAndSet() for concurrent processing
Cardinality Calculation Method
The cardinality() method in LogLog.java implements the core estimation algorithm:
- Restoration Function: Converts leading zero counts back to approximate original values using bit shifting
- Harmonic Mean: Applies compensation factor 0.7213428177 for LogLog algorithm accuracy correction
- Multiple Estimates: Combines results from multiple hash functions using weighted averaging
- Variance Calculation: Provides standard deviation estimates through Tools.standardDeviation()
MultiLogLog Architecture
The MultiLogLog class coordinates multiple CardinalityTracker instances:
- Independent Counters: Each kmer length maintains separate LogLog counters array
- Parallel Processing: Multiple hash functions processed simultaneously via mlogArray
- Kmer Validation: Uses Kmer.getKbig() to validate and normalize kmer lengths
- Memory Layout: Contiguous allocation of counter arrays for cache locality
Threading and Concurrency
The ProcessThread implementation provides concurrent read processing:
- Read Distribution: ConcurrentReadInputStream distributes reads across worker threads
- Per-Thread Counters: Each thread maintains independent readsProcessedT and basesProcessedT
- Lock-Free Updates: AtomicIntegerArray operations eliminate synchronization overhead
- Thread Coordination: Uses Thread.join() pattern for completion synchronization
Output Generation
The writeOutput() method produces tab-delimited results:
- Multi-Way Averaging: Combines cardinality estimates from multiple hash functions
- Weighted vs Arithmetic: Supports both Tools.weightedAverage() and simple averaging
- Standard Deviation: Calculates variance as percentage of mean when multiple hashes used
- Sorted Results: Arrays.sort() applied before statistical calculations
Quality-Based Filtering Implementation
When minprob is configured, the tool filters kmers through quality score analysis in the CardinalityTracker base class, excluding kmers below the probability threshold before hash calculation.
Memory Usage Characteristics
Memory consumption scales predictably with configuration parameters:
- Bucket Arrays: buckets * sizeof(int) * ways * kmer_count bytes for counter storage
- Read Buffers: Fixed overhead from ConcurrentReadInputStream buffering
- Thread Overhead: Shared.threads() * thread_stack_size for concurrent processing
- Atomic Arrays: Additional AtomicIntegerArray overhead when atomic=true
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org