KmerCountExact
Counts the number of unique kmers in a file. Generates a kmer frequency histogram and genome size estimate (in peaks output), and prints a file containing all kmers and their counts. Supports K=1 to infinity, though not all values are allowed. SEE ALSO: bbnorm.sh/khist.sh, loglog.sh, and kmercountmulti.sh.
Basic Usage
kmercountexact.sh in=<file> khist=<file> peaks=<file>
Input may be fasta or fastq, compressed or uncompressed. Output may be stdout or a file. out, khist, and peaks are optional.
Parameters
Parameters are organized by their function in the kmer counting process.
Input parameters
- in=<file>
- Primary input file.
- in2=<file>
- Second input file for paired reads.
- amino=f
- Run in amino acid mode.
Output parameters
- out=<file>
- Print kmers and their counts. This produces a huge file, so skip it if you only need the histogram.
- fastadump=t
- Print kmers and counts as fasta versus 2-column tsv.
- mincount=1
- Only print kmers with at least this depth.
- reads=-1
- Only process this number of reads, then quit (-1 means all).
- dumpthreads=-1
- Use this number of threads for dumping kmers (-1 means auto).
Hashing parameters
- k=31
- Kmer length (1-31 is fastest).
- prealloc=t
- Pre-allocate memory rather than dynamically growing using KmerTableSet.prealloc. A float fraction (0-1) may be specified, default 1.
- prefilter=0
- If set to a positive integer, use a countmin sketch to ignore kmers with depth of that value or lower.
- prehashes=2
- Number of hashes for prefilter.
- prefiltersize=0.2
- Fraction of memory to use for prefilter.
- minq=6
- Ignore kmers containing bases with quality below this. (TODO)
- minprob=0.0
- Ignore kmers with overall probability of correctness below this.
- threads=X
- Spawn X hashing threads (default is number of logical processors).
- onepass=f
- If true, prefilter will be generated in same pass as kmer counts. Much faster but counts will be lower, by up to prefilter's depth limit.
- rcomp=t
- Store and count each kmer together and its reverse-complement.
Histogram parameters
- khist=<file>
- Print kmer frequency histogram.
- histcolumns=2
- 2 columns: (depth, count). 3 columns: (depth, rawCount, count).
- histmax=100000
- Maximum depth to print in histogram output.
- histheader=t
- Set true to print a header line.
- nzo=t
- (nonzeroonly) Only print lines for depths with a nonzero kmer count.
- gchist=f
- Add an extra histogram column with the average GC%.
Intersection parameters
- ref=<file>
- An input assembly of the input reads.
- intersection=<file>
- Output file for a 2-D histogram of read and ref kmer counts.
- refmax=6
- Maximum reference kmer depth to track; read depth is controlled by 'histmax'.
Smoothing parameters
- smoothkhist=f
- Smooth the output kmer histogram.
- smoothpeaks=t
- Smooth the kmer histogram for peak-calling, but does not affect the output histogram.
- smoothradius=1
- Initial radius of progressive smoothing function.
- maxradius=10
- Maximum radius of progressive smoothing function.
- progressivemult=2
- Increment radius each time depth increases by this factor.
- logscale=t
- Transform to log-scale prior to peak-calling.
- logwidth=0.1
- The larger the number, the smoother.
Peak calling parameters
- peaks=<file>
- Write the peaks to this file. Default is stdout. Also contains the genome size estimate in bp.
- minHeight=2
- (h) Ignore peaks shorter than this.
- minVolume=5
- (v) Ignore peaks with less area than this.
- minWidth=3
- (w) Ignore peaks narrower than this.
- minPeak=2
- (minp) Ignore peaks with an X-value below this.
- maxPeak=BIG
- (maxp) Ignore peaks with an X-value above this.
- maxPeakCount=12
- (maxpc) Print up to this many peaks (prioritizing height).
- ploidy=-1
- Specify ploidy; otherwise it will be autodetected.
Sketch parameters (for making a MinHashSketch)
- sketch=<file>
- Write a minhash sketch to this file.
- sketchlen=10000
- Output the top 10000 kmers. Only kmers with at least mincount are included.
- sketchname=
- Name of output sketch.
- sketchid=
- taxID of output sketch.
Quality parameters
- qtrim=f
- Trim read ends to remove bases with quality below minq. Values: t (trim both ends), f (neither end), r (right end only), l (left end only).
- trimq=4
- Trim quality threshold.
- minavgquality=0
- (maq) Reads with average quality (before trimming) below this will be discarded.
Overlap parameters (for overlapping paired-end reads only)
- merge=f
- Attempt to merge reads before counting kmers.
- ecco=f
- Error correct via overlap, but do not merge reads.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Kmer Counting
kmercountexact.sh in=reads.fq khist=histogram.txt peaks=peaks.txt
Count kmers in reads.fq, generate a frequency histogram and identify peaks for genome size estimation.
Kmer Counting with Output
kmercountexact.sh in=reads.fq out=kmers.txt khist=histogram.txt mincount=5
Count kmers and output all kmers with count ≥5, plus generate frequency histogram.
High-Memory Kmer Counting
kmercountexact.sh in=reads.fq khist=histogram.txt k=25 prealloc=t -Xmx50g
Use k=25 kmers with pre-allocated memory using up to 50GB RAM.
Kmer Counting with Reference Comparison
kmercountexact.sh in=reads.fq ref=assembly.fa intersection=compare.txt khist=histogram.txt
Count kmers in reads and compare with reference assembly, generating 2D intersection histogram.
MinHash Sketch Generation
kmercountexact.sh in=reads.fq sketch=sketch.txt sketchlen=20000 mincount=3
Generate a MinHash sketch containing the top 20,000 kmers with count ≥3.
Algorithm Details
Exact Kmer Counting Strategy
KmerCountExact implements hash-based data structures for precise kmer counting without probabilistic approximations. The implementation uses two main storage strategies based on kmer length:
- K ≤ 31: Uses KmerTableSet with compact long-based storage via HashArray1D
- K > 31: Uses KmerTableSetU (unlimited) with byte array storage for arbitrary-length kmers
Hash Table Implementation
The core data structure employs HashArray1D with overflow handling via HashForest for collision resolution. This hybrid approach provides:
- O(1) average access time for most kmers
- Hash collision handling through HashForest structures
- Storage with pre-allocation options via KmerTableSet.prealloc
Memory Management
The tool implements memory management through several mechanisms:
- Pre-allocation (prealloc=t): Allocates full hash table size upfront using KmerTableSet.prealloc parameter
- Dynamic growth: HashArray1D tables grow incrementally when pre-allocation is disabled
- Prefilter system: Uses CountMin sketch implementation to exclude low-abundance kmers before main counting
Peak Calling Algorithm
The peak detection system implements CallPeaks class processing:
- Progressive smoothing: CallPeaks.progressiveMult factor increases smoothing radius by specified multiplier
- Log-scale transformation: doLogScale parameter with logWidth value converts histograms for peak detection
- Multi-criteria filtering: minHeight, minVolume, minWidth, minPeak, and maxPeak thresholds filter significant peaks
- Genome size estimation: CallPeaks.printPeaks method calculates genome size based on peak positions and kmer counts
Performance Optimizations
Implementation includes several optimizations:
- Multithreaded processing: THREADS parameter controls parallel kmer extraction using multiple processing threads
- ByteFile I/O: ByteFile framework with FORCE_MODE_BF2 for compressed input handling
- Walker pattern: Walker and WalkerU classes iterate over hash tables without data copying
- Reverse complement handling: KmerTableSet.rcomp parameter enables canonical kmer representation
Quality and Filtering Integration
The tool implements quality-aware processing:
- Base quality filtering: minq parameter excludes kmers containing bases below quality threshold
- Probabilistic quality assessment: minprob parameter sets kmer correctness probability threshold
- Parser integration: Parser.parseQuality() and Parser.parseTrim() handle quality and trimming parameters
Notes
- For very large datasets, consider using kmercountmulti.sh for approximate counting with lower memory usage
- The histogram output provides the foundation for many downstream analyses including genome size estimation and heterozygosity assessment
- Peak calling works best with high-coverage datasets where coverage peaks are clearly separated from error kmers
- Memory usage scales roughly with the number of unique kmers times the kmer length
- For k>31, memory usage increases significantly due to variable-length storage requirements
- The intersection feature is particularly useful for assembly validation and contamination detection
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org