KmerCoverage
Annotates reads with their kmer depth. DEPRECATED: This should still work but is no longer maintained.
⚠️ DEPRECATED TOOL
This tool is deprecated and no longer maintained. It should still work but may not receive updates or bug fixes. Consider using alternative tools for kmer analysis.
Basic Usage
kmercoverage.sh in=<input> out=<read output> hist=<histogram output>
Processes reads to determine the depth (frequency) of each kmer within the reads. Can output annotated reads and/or generate depth histograms for analysis.
Parameters
Parameters are organized by their function in the kmer coverage analysis process.
Input Parameters
- in=file
- Primary input file containing reads to analyze.
- in2=null
- Second input file for paired reads.
- extra=null
- Additional files to use for input (generating hash table) but not for output. Can be a comma-separated list.
- fastareadlen=2^31
- Break up FASTA reads longer than this. Can be useful when processing scaffolded genomes.
- tablereads=-1
- Use at most this many reads when building the hashtable (-1 means all).
- kmersample=1
- Process every nth kmer, and skip the rest. Reduces memory usage and processing time.
- readsample=1
- Process every nth read, and skip the rest. Reduces memory usage and processing time.
Output Parameters
- out=file
- Output file for processed reads with coverage annotations.
- hist=null
- Specify a file to output the depth histogram showing kmer frequency distribution.
- histlen=10000
- Max depth displayed on histogram. Values beyond this are binned together.
- reads=-1
- Only process this number of reads, then quit (-1 means all).
- sampleoutput=t
- Use sampling on output as well as input (not used if sample rates are 1).
- printcoverage=f
- Only print coverage information instead of reads. Outputs coverage data only.
- useheader=f
- Append coverage info to the read's header instead of as separate attachment.
- minmedian=0
- Don't output reads with median coverage below this threshold.
- minaverage=0
- Don't output reads with average coverage below this threshold.
- zerobin=f
- Set to true if you want kmers with a count of 0 to go in the 0 bin instead of the 1 bin in histograms. Default is false, to prevent confusion about how there can be 0-count kmers. The reason is that based on the 'minq' and 'minprob' settings, some kmers may be excluded from the bloom filter.
Hashing Parameters
- k=31
- Kmer length (values under 32 are most efficient, but arbitrarily high values are supported). Longer kmers are more specific but require more memory.
- cbits=8
- Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32. Maximum kmer depth recorded is 2^cbits. Large values decrease accuracy for a fixed amount of memory.
- hashes=4
- Number of times a kmer is hashed. Higher is slower. Higher is MORE accurate if there is enough memory, and LESS accurate if there is not enough memory.
- prefilter=f
- True is slower, but generally more accurate; filters out low-depth kmers from the main hashtable.
- prehashes=2
- Number of hashes for prefilter. Used when prefilter=true.
- passes=1
- More passes can sometimes increase accuracy by iteratively removing low-depth kmers.
- minq=7
- Ignore kmers containing bases with quality below this threshold. Helps reduce noise from sequencing errors.
- minprob=0.5
- Ignore kmers with overall probability of correctness below this threshold. Calculated from base qualities.
- threads=X
- Spawn exactly X hashing threads (default is number of logical processors). Total active threads may exceed X by up to 4.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Kmer Coverage Analysis
kmercoverage.sh in=reads.fastq out=annotated.fastq hist=coverage.hist
Analyzes kmer coverage in reads.fastq, outputs annotated reads to annotated.fastq, and generates a coverage histogram in coverage.hist.
High Memory Efficiency Mode
kmercoverage.sh in=reads.fastq hist=coverage.hist kmersample=3 cbits=4 k=25
Runs in memory-efficient mode by sampling every 3rd kmer, using 4 bits per cell, and shorter 25-mer length for reduced memory usage.
Quality Filtering
kmercoverage.sh in=reads.fastq out=filtered.fastq minq=10 minprob=0.8 minmedian=5
Filters out low-quality kmers (quality < 10, probability < 0.8) and reads with median coverage below 5.
Using Extra Input Files
kmercoverage.sh in=target.fastq extra=ref1.fasta,ref2.fasta out=annotated.fastq hist=coverage.hist
Uses ref1.fasta and ref2.fasta to build the kmer hash table but only processes target.fastq for output. Useful for analyzing coverage relative to reference sequences.
Algorithm Details
Hash Table Construction
KmerCoverage implements a two-stage hash table construction process using the ReadCounter.makeKca() method:
- KCountArray Data Structure: Uses bloom.KCountArray with configurable cell bit allocation (2, 4, 8, 16, or 32 bits per cell)
- Memory Allocation Formula: Calculates cells = ((memory-96000000)×0.73)×8/cbits, reserving 96MB overhead
- Multi-threaded Construction: Spawns THREADS hash table building processes with buildStepsize=4 for parallel construction
- Optional Prefilter: When enabled, builds a 2-bit prefilterArray consuming 35% of available bits before main table construction
Kmer Processing Implementation
The generateCoverage() method implements distinct algorithms based on k-mer length:
- K ≤ 31: Uses bit-shifted encoding with mask = ~((-1L)<<(2*k)) for 64-bit kmer representation
- K > 31: Uses Long.rotateLeft() with XOR operations for extended kmer handling, disabling canonical mode
- Canonical Processing: For k ≤ 31, compares forward and reverse complement using lexicographic ordering
- Quality Filtering: Excludes kmers below minq=7 quality threshold and minprob=0.5 correctness probability
Coverage Annotation Pipeline
ProcessThread.countInThread() executes the annotation workflow:
- Kmer Extraction: Sliding window generates overlapping kmers using AminoAcid.baseToNumber[] lookup table
- Hash Lookup: Queries KCountArray using kca.read(kmer, k, CANONICAL) method
- Statistical Computation: Calculates median using Arrays.sort() and average using Tools.averageInt()
- Histogram Update: Thread-local hist[] arrays track coverage distribution with bounds checking
- Quality Threshold Filtering: Discards reads where median < MIN_MEDIAN or average < MIN_AVERAGE
Memory Management Strategy
Memory allocation follows specific calculation patterns from the source code:
- Available Memory: Uses Tools.max(((memory-96000000)×0.73), memory×0.45) to determine usable heap space
- Histogram Memory: Reserves HIST_LEN×8×(THREADS+1) bytes for parallel histogram tracking
- Multi-pass Adjustment: When buildpasses > 1, halves available memory to accommodate temporary structures
- Cell Calculation: Determines cells = (available_memory×8)/cbits for optimal hash table sizing
Spike Detection Algorithm
The fixSpikes() method implements false positive correction using adjacent kmer comparison:
- Peak Detection: Identifies kmers where count[i] > count[i-1] AND count[i] > count[i+1]
- Spike Criteria: Flags peaks where count ≥ 2×adjacent OR count > adjacent+2 on both sides
- Correction Method: Replaces spike values with Tools.max(left_neighbor, right_neighbor)
- Precise Lookup: Uses kca.readPreciseMin() for boundary verification when FIX_SPIKES=true
Topology Classification System
The analyzeSpikes() method categorizes coverage patterns using atomic counters:
- Spikes: AtomicLong counter tracks peaks meeting spike criteria (isolated high values)
- Peaks: Local maxima where coverage[i] > coverage[i±1] but not classified as spikes
- Valleys: Local minima where coverage[i] < coverage[i±1] on both sides
- Slopes: Positions where coverage differs from neighbors but not peaks/valleys
- Flats: Regions where coverage[i] == coverage[i-1] == coverage[i+1]
Canonical Kmer Implementation
Canonical mode handling uses specific bitwise operations:
- K ≤ 31 Mode: Compares kmer and reverse complement using lexicographic ordering in CANONICAL=true mode
- K > 31 Limitation: Line 209: if(k>31){CANONICAL=KmerCountAbstract.CANONICAL=false;} disables canonical processing
- Memory Benefit: Canonical mode approximately halves unique kmer count for storage efficiency
- Accuracy Impact: Non-canonical mode processes forward and reverse kmers separately, doubling hash table entries
Performance Notes
- Threading Architecture: Spawns ProcessThread[THREADS] array with default THREADS=Shared.LOGICAL_PROCESSORS, allowing up to 4 additional threads beyond user specification
- Memory Scaling: Supports kmersamplerate and readsamplerate parameters to process every nth kmer/read, reducing memory footprint proportionally
- I/O Implementation: Uses ConcurrentReadInputStream and ConcurrentReadOutputStream with configurable buffer sizes (8 unordered, max(16, 2×THREADS) ordered)
- Canonical Mode Performance: For k > 31, disables canonical processing to avoid Long.rotateLeft() computational overhead
- Hash Table Limits: When prefilter=false and k < 32, constrains cells ≤ (1L<<(2×k)) to prevent over-allocation
- DEPRECATED Status: Last modified May 23, 2014 - no longer maintained, consider alternatives for production use
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
Note: As this is a deprecated tool, support may be limited. Please mention the deprecated status when seeking help.