KmerCountExact

Script: kmercountexact.sh Package: jgi Class: KmerCountExact.java

Counts the number of unique kmers in a file. Generates a kmer frequency histogram and genome size estimate (in peaks output), and prints a file containing all kmers and their counts. Supports K=1 to infinity, though not all values are allowed. SEE ALSO: bbnorm.sh/khist.sh, loglog.sh, and kmercountmulti.sh.

Basic Usage

kmercountexact.sh in=<file> khist=<file> peaks=<file>

Input may be fasta or fastq, compressed or uncompressed. Output may be stdout or a file. out, khist, and peaks are optional.

Parameters

Parameters are organized by their function in the kmer counting process.

Input parameters

in=<file>: Primary input file.
in2=<file>: Second input file for paired reads.
amino=f: Run in amino acid mode.

Output parameters

out=<file>: Print kmers and their counts. This produces a huge file, so skip it if you only need the histogram.
fastadump=t: Print kmers and counts as fasta versus 2-column tsv.
mincount=1: Only print kmers with at least this depth.
reads=-1: Only process this number of reads, then quit (-1 means all).
dumpthreads=-1: Use this number of threads for dumping kmers (-1 means auto).

Hashing parameters

k=31: Kmer length (1-31 is fastest).
prealloc=t: Pre-allocate memory rather than dynamically growing using KmerTableSet.prealloc. A float fraction (0-1) may be specified, default 1.
prefilter=0: If set to a positive integer, use a countmin sketch to ignore kmers with depth of that value or lower.
prehashes=2: Number of hashes for prefilter.
prefiltersize=0.2: Fraction of memory to use for prefilter.
minq=6: Ignore kmers containing bases with quality below this. (TODO)
minprob=0.0: Ignore kmers with overall probability of correctness below this.
threads=X: Spawn X hashing threads (default is number of logical processors).
onepass=f: If true, prefilter will be generated in same pass as kmer counts. Much faster but counts will be lower, by up to prefilter's depth limit.
rcomp=t: Store and count each kmer together and its reverse-complement.

Histogram parameters

khist=<file>: Print kmer frequency histogram.
histcolumns=2: 2 columns: (depth, count). 3 columns: (depth, rawCount, count).
histmax=100000: Maximum depth to print in histogram output.
histheader=t: Set true to print a header line.
nzo=t: (nonzeroonly) Only print lines for depths with a nonzero kmer count.
gchist=f: Add an extra histogram column with the average GC%.

Intersection parameters

ref=<file>: An input assembly of the input reads.
intersection=<file>: Output file for a 2-D histogram of read and ref kmer counts.
refmax=6: Maximum reference kmer depth to track; read depth is controlled by 'histmax'.

Smoothing parameters

smoothkhist=f: Smooth the output kmer histogram.
smoothpeaks=t: Smooth the kmer histogram for peak-calling, but does not affect the output histogram.
smoothradius=1: Initial radius of progressive smoothing function.
maxradius=10: Maximum radius of progressive smoothing function.
progressivemult=2: Increment radius each time depth increases by this factor.
logscale=t: Transform to log-scale prior to peak-calling.
logwidth=0.1: The larger the number, the smoother.

Peak calling parameters

peaks=<file>: Write the peaks to this file. Default is stdout. Also contains the genome size estimate in bp.
minHeight=2: (h) Ignore peaks shorter than this.
minVolume=5: (v) Ignore peaks with less area than this.
minWidth=3: (w) Ignore peaks narrower than this.
minPeak=2: (minp) Ignore peaks with an X-value below this.
maxPeak=BIG: (maxp) Ignore peaks with an X-value above this.
maxPeakCount=12: (maxpc) Print up to this many peaks (prioritizing height).
ploidy=-1: Specify ploidy; otherwise it will be autodetected.

Sketch parameters (for making a MinHashSketch)

sketch=<file>: Write a minhash sketch to this file.
sketchlen=10000: Output the top 10000 kmers. Only kmers with at least mincount are included.
sketchname=: Name of output sketch.
sketchid=: taxID of output sketch.

Quality parameters

qtrim=f: Trim read ends to remove bases with quality below minq. Values: t (trim both ends), f (neither end), r (right end only), l (left end only).
trimq=4: Trim quality threshold.
minavgquality=0: (maq) Reads with average quality (before trimming) below this will be discarded.

Overlap parameters (for overlapping paired-end reads only)

merge=f: Attempt to merge reads before counting kmers.
ecco=f: Error correct via overlap, but do not merge reads.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Kmer Counting

kmercountexact.sh in=reads.fq khist=histogram.txt peaks=peaks.txt

Count kmers in reads.fq, generate a frequency histogram and identify peaks for genome size estimation.

Kmer Counting with Output

kmercountexact.sh in=reads.fq out=kmers.txt khist=histogram.txt mincount=5

Count kmers and output all kmers with count ≥5, plus generate frequency histogram.

High-Memory Kmer Counting

kmercountexact.sh in=reads.fq khist=histogram.txt k=25 prealloc=t -Xmx50g

Use k=25 kmers with pre-allocated memory using up to 50GB RAM.

Kmer Counting with Reference Comparison

kmercountexact.sh in=reads.fq ref=assembly.fa intersection=compare.txt khist=histogram.txt

Count kmers in reads and compare with reference assembly, generating 2D intersection histogram.

MinHash Sketch Generation

kmercountexact.sh in=reads.fq sketch=sketch.txt sketchlen=20000 mincount=3

Generate a MinHash sketch containing the top 20,000 kmers with count ≥3.

Algorithm Details

Exact Kmer Counting Strategy

KmerCountExact implements hash-based data structures for precise kmer counting without probabilistic approximations. The implementation uses two main storage strategies based on kmer length:

K ≤ 31: Uses KmerTableSet with compact long-based storage via HashArray1D
K > 31: Uses KmerTableSetU (unlimited) with byte array storage for arbitrary-length kmers

Hash Table Implementation

The core data structure employs HashArray1D with overflow handling via HashForest for collision resolution. This hybrid approach provides:

O(1) average access time for most kmers
Hash collision handling through HashForest structures
Storage with pre-allocation options via KmerTableSet.prealloc

Memory Management

The tool implements memory management through several mechanisms:

Pre-allocation (prealloc=t): Allocates full hash table size upfront using KmerTableSet.prealloc parameter
Dynamic growth: HashArray1D tables grow incrementally when pre-allocation is disabled
Prefilter system: Uses CountMin sketch implementation to exclude low-abundance kmers before main counting

Peak Calling Algorithm

The peak detection system implements CallPeaks class processing:

Progressive smoothing: CallPeaks.progressiveMult factor increases smoothing radius by specified multiplier
Log-scale transformation: doLogScale parameter with logWidth value converts histograms for peak detection
Multi-criteria filtering: minHeight, minVolume, minWidth, minPeak, and maxPeak thresholds filter significant peaks
Genome size estimation: CallPeaks.printPeaks method calculates genome size based on peak positions and kmer counts

Performance Optimizations

Implementation includes several optimizations:

Multithreaded processing: THREADS parameter controls parallel kmer extraction using multiple processing threads
ByteFile I/O: ByteFile framework with FORCE_MODE_BF2 for compressed input handling
Walker pattern: Walker and WalkerU classes iterate over hash tables without data copying
Reverse complement handling: KmerTableSet.rcomp parameter enables canonical kmer representation

Quality and Filtering Integration

The tool implements quality-aware processing:

Base quality filtering: minq parameter excludes kmers containing bases below quality threshold
Probabilistic quality assessment: minprob parameter sets kmer correctness probability threshold
Parser integration: Parser.parseQuality() and Parser.parseTrim() handle quality and trimming parameters

Notes

For very large datasets, consider using kmercountmulti.sh for approximate counting with lower memory usage
The histogram output provides the foundation for many downstream analyses including genome size estimation and heterozygosity assessment
Peak calling works best with high-coverage datasets where coverage peaks are clearly separated from error kmers
Memory usage scales roughly with the number of unique kmers times the kmer length
For k>31, memory usage increases significantly due to variable-length storage requirements
The intersection feature is particularly useful for assembly validation and contamination detection

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org