KmerCoverage

Script: kmercoverage.sh Package: jgi Class: KmerCoverage.java

Annotates reads with their kmer depth. DEPRECATED: This should still work but is no longer maintained.

⚠️ DEPRECATED TOOL

This tool is deprecated and no longer maintained. It should still work but may not receive updates or bug fixes. Consider using alternative tools for kmer analysis.

Basic Usage

kmercoverage.sh in=<input> out=<read output> hist=<histogram output>

Processes reads to determine the depth (frequency) of each kmer within the reads. Can output annotated reads and/or generate depth histograms for analysis.

Parameters

Parameters are organized by their function in the kmer coverage analysis process.

Input Parameters

in=file
Primary input file containing reads to analyze.
in2=null
Second input file for paired reads.
extra=null
Additional files to use for input (generating hash table) but not for output. Can be a comma-separated list.
fastareadlen=2^31
Break up FASTA reads longer than this. Can be useful when processing scaffolded genomes.
tablereads=-1
Use at most this many reads when building the hashtable (-1 means all).
kmersample=1
Process every nth kmer, and skip the rest. Reduces memory usage and processing time.
readsample=1
Process every nth read, and skip the rest. Reduces memory usage and processing time.

Output Parameters

out=file
Output file for processed reads with coverage annotations.
hist=null
Specify a file to output the depth histogram showing kmer frequency distribution.
histlen=10000
Max depth displayed on histogram. Values beyond this are binned together.
reads=-1
Only process this number of reads, then quit (-1 means all).
sampleoutput=t
Use sampling on output as well as input (not used if sample rates are 1).
printcoverage=f
Only print coverage information instead of reads. Outputs coverage data only.
useheader=f
Append coverage info to the read's header instead of as separate attachment.
minmedian=0
Don't output reads with median coverage below this threshold.
minaverage=0
Don't output reads with average coverage below this threshold.
zerobin=f
Set to true if you want kmers with a count of 0 to go in the 0 bin instead of the 1 bin in histograms. Default is false, to prevent confusion about how there can be 0-count kmers. The reason is that based on the 'minq' and 'minprob' settings, some kmers may be excluded from the bloom filter.

Hashing Parameters

k=31
Kmer length (values under 32 are most efficient, but arbitrarily high values are supported). Longer kmers are more specific but require more memory.
cbits=8
Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32. Maximum kmer depth recorded is 2^cbits. Large values decrease accuracy for a fixed amount of memory.
hashes=4
Number of times a kmer is hashed. Higher is slower. Higher is MORE accurate if there is enough memory, and LESS accurate if there is not enough memory.
prefilter=f
True is slower, but generally more accurate; filters out low-depth kmers from the main hashtable.
prehashes=2
Number of hashes for prefilter. Used when prefilter=true.
passes=1
More passes can sometimes increase accuracy by iteratively removing low-depth kmers.
minq=7
Ignore kmers containing bases with quality below this threshold. Helps reduce noise from sequencing errors.
minprob=0.5
Ignore kmers with overall probability of correctness below this threshold. Calculated from base qualities.
threads=X
Spawn exactly X hashing threads (default is number of logical processors). Total active threads may exceed X by up to 4.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Kmer Coverage Analysis

kmercoverage.sh in=reads.fastq out=annotated.fastq hist=coverage.hist

Analyzes kmer coverage in reads.fastq, outputs annotated reads to annotated.fastq, and generates a coverage histogram in coverage.hist.

High Memory Efficiency Mode

kmercoverage.sh in=reads.fastq hist=coverage.hist kmersample=3 cbits=4 k=25

Runs in memory-efficient mode by sampling every 3rd kmer, using 4 bits per cell, and shorter 25-mer length for reduced memory usage.

Quality Filtering

kmercoverage.sh in=reads.fastq out=filtered.fastq minq=10 minprob=0.8 minmedian=5

Filters out low-quality kmers (quality < 10, probability < 0.8) and reads with median coverage below 5.

Using Extra Input Files

kmercoverage.sh in=target.fastq extra=ref1.fasta,ref2.fasta out=annotated.fastq hist=coverage.hist

Uses ref1.fasta and ref2.fasta to build the kmer hash table but only processes target.fastq for output. Useful for analyzing coverage relative to reference sequences.

Algorithm Details

Hash Table Construction

KmerCoverage implements a two-stage hash table construction process using the ReadCounter.makeKca() method:

Kmer Processing Implementation

The generateCoverage() method implements distinct algorithms based on k-mer length:

Coverage Annotation Pipeline

ProcessThread.countInThread() executes the annotation workflow:

  1. Kmer Extraction: Sliding window generates overlapping kmers using AminoAcid.baseToNumber[] lookup table
  2. Hash Lookup: Queries KCountArray using kca.read(kmer, k, CANONICAL) method
  3. Statistical Computation: Calculates median using Arrays.sort() and average using Tools.averageInt()
  4. Histogram Update: Thread-local hist[] arrays track coverage distribution with bounds checking
  5. Quality Threshold Filtering: Discards reads where median < MIN_MEDIAN or average < MIN_AVERAGE

Memory Management Strategy

Memory allocation follows specific calculation patterns from the source code:

Spike Detection Algorithm

The fixSpikes() method implements false positive correction using adjacent kmer comparison:

Topology Classification System

The analyzeSpikes() method categorizes coverage patterns using atomic counters:

Canonical Kmer Implementation

Canonical mode handling uses specific bitwise operations:

Performance Notes

Support

For questions and support:

Note: As this is a deprecated tool, support may be limited. Please mention the deprecated status when seeking help.