KmerCoverage

⚠️ DEPRECATED TOOL

This tool is deprecated and no longer maintained. It should still work but may not receive updates or bug fixes. Consider using alternative tools for kmer analysis.

Basic Usage

kmercoverage.sh in=<input> out=<read output> hist=<histogram output>

Processes reads to determine the depth (frequency) of each kmer within the reads. Can output annotated reads and/or generate depth histograms for analysis.

Parameters

Parameters are organized by their function in the kmer coverage analysis process.

Input Parameters

in=file: Primary input file containing reads to analyze.
in2=null: Second input file for paired reads.
extra=null: Additional files to use for input (generating hash table) but not for output. Can be a comma-separated list.
fastareadlen=2^31: Break up FASTA reads longer than this. Can be useful when processing scaffolded genomes.
tablereads=-1: Use at most this many reads when building the hashtable (-1 means all).
kmersample=1: Process every nth kmer, and skip the rest. Reduces memory usage and processing time.
readsample=1: Process every nth read, and skip the rest. Reduces memory usage and processing time.

Output Parameters

out=file: Output file for processed reads with coverage annotations.
hist=null: Specify a file to output the depth histogram showing kmer frequency distribution.
histlen=10000: Max depth displayed on histogram. Values beyond this are binned together.
reads=-1: Only process this number of reads, then quit (-1 means all).
sampleoutput=t: Use sampling on output as well as input (not used if sample rates are 1).
printcoverage=f: Only print coverage information instead of reads. Outputs coverage data only.
useheader=f: Append coverage info to the read's header instead of as separate attachment.
minmedian=0: Don't output reads with median coverage below this threshold.
minaverage=0: Don't output reads with average coverage below this threshold.
zerobin=f: Set to true if you want kmers with a count of 0 to go in the 0 bin instead of the 1 bin in histograms. Default is false, to prevent confusion about how there can be 0-count kmers. The reason is that based on the 'minq' and 'minprob' settings, some kmers may be excluded from the bloom filter.

Hashing Parameters

k=31: Kmer length (values under 32 are most efficient, but arbitrarily high values are supported). Longer kmers are more specific but require more memory.
cbits=8: Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32. Maximum kmer depth recorded is 2^cbits. Large values decrease accuracy for a fixed amount of memory.
hashes=4: Number of times a kmer is hashed. Higher is slower. Higher is MORE accurate if there is enough memory, and LESS accurate if there is not enough memory.
prefilter=f: True is slower, but generally more accurate; filters out low-depth kmers from the main hashtable.
prehashes=2: Number of hashes for prefilter. Used when prefilter=true.
passes=1: More passes can sometimes increase accuracy by iteratively removing low-depth kmers.
minq=7: Ignore kmers containing bases with quality below this threshold. Helps reduce noise from sequencing errors.
minprob=0.5: Ignore kmers with overall probability of correctness below this threshold. Calculated from base qualities.
threads=X: Spawn exactly X hashing threads (default is number of logical processors). Total active threads may exceed X by up to 4.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Kmer Coverage Analysis

kmercoverage.sh in=reads.fastq out=annotated.fastq hist=coverage.hist

Analyzes kmer coverage in reads.fastq, outputs annotated reads to annotated.fastq, and generates a coverage histogram in coverage.hist.

High Memory Efficiency Mode

kmercoverage.sh in=reads.fastq hist=coverage.hist kmersample=3 cbits=4 k=25

Runs in memory-efficient mode by sampling every 3rd kmer, using 4 bits per cell, and shorter 25-mer length for reduced memory usage.

Quality Filtering

kmercoverage.sh in=reads.fastq out=filtered.fastq minq=10 minprob=0.8 minmedian=5

Filters out low-quality kmers (quality < 10, probability < 0.8) and reads with median coverage below 5.

Using Extra Input Files

kmercoverage.sh in=target.fastq extra=ref1.fasta,ref2.fasta out=annotated.fastq hist=coverage.hist

Uses ref1.fasta and ref2.fasta to build the kmer hash table but only processes target.fastq for output. Useful for analyzing coverage relative to reference sequences.

Algorithm Details

Hash Table Construction

KmerCoverage implements a two-stage hash table construction process using the ReadCounter.makeKca() method:

KCountArray Data Structure: Uses bloom.KCountArray with configurable cell bit allocation (2, 4, 8, 16, or 32 bits per cell)
Memory Allocation Formula: Calculates cells = ((memory-96000000)×0.73)×8/cbits, reserving 96MB overhead
Multi-threaded Construction: Spawns THREADS hash table building processes with buildStepsize=4 for parallel construction
Optional Prefilter: When enabled, builds a 2-bit prefilterArray consuming 35% of available bits before main table construction

Kmer Processing Implementation

The generateCoverage() method implements distinct algorithms based on k-mer length:

K ≤ 31: Uses bit-shifted encoding with mask = ~((-1L)<<(2*k)) for 64-bit kmer representation
K > 31: Uses Long.rotateLeft() with XOR operations for extended kmer handling, disabling canonical mode
Canonical Processing: For k ≤ 31, compares forward and reverse complement using lexicographic ordering
Quality Filtering: Excludes kmers below minq=7 quality threshold and minprob=0.5 correctness probability

Coverage Annotation Pipeline

ProcessThread.countInThread() executes the annotation workflow:

Kmer Extraction: Sliding window generates overlapping kmers using AminoAcid.baseToNumber[] lookup table
Hash Lookup: Queries KCountArray using kca.read(kmer, k, CANONICAL) method
Statistical Computation: Calculates median using Arrays.sort() and average using Tools.averageInt()
Histogram Update: Thread-local hist[] arrays track coverage distribution with bounds checking
Quality Threshold Filtering: Discards reads where median < MIN_MEDIAN or average < MIN_AVERAGE

Memory Management Strategy

Memory allocation follows specific calculation patterns from the source code:

Available Memory: Uses Tools.max(((memory-96000000)×0.73), memory×0.45) to determine usable heap space
Histogram Memory: Reserves HIST_LEN×8×(THREADS+1) bytes for parallel histogram tracking
Multi-pass Adjustment: When buildpasses > 1, halves available memory to accommodate temporary structures
Cell Calculation: Determines cells = (available_memory×8)/cbits for optimal hash table sizing

Spike Detection Algorithm

The fixSpikes() method implements false positive correction using adjacent kmer comparison:

Peak Detection: Identifies kmers where count[i] > count[i-1] AND count[i] > count[i+1]
Spike Criteria: Flags peaks where count ≥ 2×adjacent OR count > adjacent+2 on both sides
Correction Method: Replaces spike values with Tools.max(left_neighbor, right_neighbor)
Precise Lookup: Uses kca.readPreciseMin() for boundary verification when FIX_SPIKES=true

Topology Classification System

The analyzeSpikes() method categorizes coverage patterns using atomic counters:

Spikes: AtomicLong counter tracks peaks meeting spike criteria (isolated high values)
Peaks: Local maxima where coverage[i] > coverage[i±1] but not classified as spikes
Valleys: Local minima where coverage[i] < coverage[i±1] on both sides
Slopes: Positions where coverage differs from neighbors but not peaks/valleys
Flats: Regions where coverage[i] == coverage[i-1] == coverage[i+1]

Canonical Kmer Implementation

Canonical mode handling uses specific bitwise operations:

K ≤ 31 Mode: Compares kmer and reverse complement using lexicographic ordering in CANONICAL=true mode
K > 31 Limitation: Line 209: if(k>31){CANONICAL=KmerCountAbstract.CANONICAL=false;} disables canonical processing
Memory Benefit: Canonical mode approximately halves unique kmer count for storage efficiency
Accuracy Impact: Non-canonical mode processes forward and reverse kmers separately, doubling hash table entries

Performance Notes

Threading Architecture: Spawns ProcessThread[THREADS] array with default THREADS=Shared.LOGICAL_PROCESSORS, allowing up to 4 additional threads beyond user specification
Memory Scaling: Supports kmersamplerate and readsamplerate parameters to process every nth kmer/read, reducing memory footprint proportionally
I/O Implementation: Uses ConcurrentReadInputStream and ConcurrentReadOutputStream with configurable buffer sizes (8 unordered, max(16, 2×THREADS) ordered)
Canonical Mode Performance: For k > 31, disables canonical processing to avoid Long.rotateLeft() computational overhead
Hash Table Limits: When prefilter=false and k < 32, constrains cells ≤ (1L<<(2×k)) to prevent over-allocation
DEPRECATED Status: Last modified May 23, 2014 - no longer maintained, consider alternatives for production use

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org

Note: As this is a deprecated tool, support may be limited. Please mention the deprecated status when seeking help.