KmerLimit

Basic Usage

kmerlimit.sh in=<input file> out=<output file> limit=<number>

KmerLimit processes reads sequentially and outputs all reads until the unique kmer limit is reached. Input reads should be in random order with respect to sequence for accurate sampling.

Important Notes

Version Differences:

kmerlimit.sh uses 1 pass and outputs all reads until a limit is hit, meaning the input reads should be in random order with respect to sequence.
kmerlimit2.sh uses 2 passes and randomly subsamples from the file, so it works with reads in any order.

Parameters

Parameters are organized by their function in the kmer limiting process.

Standard Parameters

in=file: Primary input, or read 1 input.
in2=file: Read 2 input if reads are in two files.
out=file: Primary output, or read 1 output.
out2=file: Read 2 output if reads are in two files.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Processing Parameters

k=31: Kmer length, 1-32. Controls the size of kmers used for tracking unique sequences.
limit=: The number of unique kmers to produce. This is the target limit that determines when to stop processing reads.
mincount=1: Ignore kmers seen fewer than this many times. Helps filter out sequencing errors by requiring kmers to appear multiple times.
minqual=0: Ignore bases with quality below this. Quality filtering helps improve kmer accuracy by excluding low-quality bases.
minprob=0.2: Ignore kmers with correctness probability below this. Uses quality scores to calculate the probability that a kmer is correct.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Kmer Limiting

kmerlimit.sh in=reads.fq out=limited.fq limit=1000000

Process reads until 1 million unique kmers are encountered, then stop. All reads contributing to those unique kmers will be included in the output.

Quality-Filtered Kmer Limiting

kmerlimit.sh in=reads.fq out=limited.fq limit=500000 minqual=10 minprob=0.5

Limit to 500k unique kmers while requiring minimum quality 10 and probability 0.5 for kmer correctness, reducing noise from sequencing errors.

Paired-End Processing

kmerlimit.sh in1=reads_R1.fq in2=reads_R2.fq out1=limited_R1.fq out2=limited_R2.fq limit=2000000

Process paired-end reads until 2 million unique kmers are reached, maintaining proper pairing in the output files.

Error-Resilient Limiting

kmerlimit.sh in=reads.fq out=limited.fq limit=1000000 mincount=3 k=21

Use shorter kmers (k=21) and require each kmer to appear at least 3 times to filter out most sequencing errors before counting toward the limit.

Algorithm Details

Kmer-Based Limiting Strategy

KmerLimit implements a SketchObject-based approach to track unique kmers with approximate limiting using multi-threaded ProcessThread instances:

Core Algorithm

Kmer Extraction: Slides a window of length k across each read using bit-shift operations (kmer<<2|x)&mask for forward, ((rkmer>>>2)|(x2<<shift2))&mask for reverse complement
Canonical Kmers: Calls hash(kmer, rkmer) method from SketchObject to select lexicographically smaller representation
SketchHeap Storage: Uses SketchHeap.add(hashcode) for kmer storage, with configurable heap sizes (4095 default, 32000 when mincount>1)
Multi-threaded Processing: ProcessThread.dumpHeap() synchronizes local heaps with shared heap via synchronized(sharedHeap) blocks

Quality Integration

When quality scores are available and minprob > 0, the algorithm calculates kmer correctness probability:

Base Quality Assessment: Uses align2.QualityTools.PROB_CORRECT[q] lookup table for per-base error probability conversion
Kmer Probability: Multiplies base probabilities (prob=prob*align2.QualityTools.PROB_CORRECT[q]) across k-mer window
Sliding Window Update: Updates probability by multiplying out old base with align2.QualityTools.PROB_CORRECT_INVERSE[oldq] when len>k
Threshold Filtering: Only kmers with prob>=minProb pass localHeap.checkAndAdd(hashcode) and contribute to probSum tracking

Memory Management

Adaptive Heap Size: heapSize_=4095 default, automatically set to 32000 when minCount_>1 (line 125 in constructor)
Local Threading: Each ProcessThread gets tSize=heapSize/2 for local SketchHeap initialization to minimize contention
Periodic Synchronization: dumpHeap() calls sharedHeap.genomeSizeEstimate(minCount) to check progress, then sharedHeap.add(localHeap) when count<targetKmers
Early Termination: ProcessThread.processInner() breaks loop when dumpHeap() returns count>=targetKmers

Threading Architecture

Thread Limitation: Tools.min(8, Shared.threads()) caps maximum thread usage at 8 threads regardless of system capacity
Buffer Length Optimization: Shared.setBufferLen(800) when minCount>1 to handle increased I/O demands
Concurrent Stream Processing: ConcurrentReadInputStream.getReadInputStream() with paired-end awareness
Approximation Behavior: The limiting is approximate due to ProcessThread batch processing and synchronized heap dumps

Input Order Requirements

For accurate results, input reads should be randomized with respect to sequence similarity:

Clumpified Data: If input has been processed with Clumpify, use shuffle2.sh or sortbyname.sh with flowcell flag first
Random Sampling: Random order ensures the output represents a diverse subset of the input
Alternative: Use kmerlimit2.sh if input order cannot be randomized, as it uses 2-pass random subsampling

Technical Notes

Kmer Length Considerations

The choice of k affects sensitivity and specificity:

Shorter kmers (k=15-21): More sensitive to sequence similarity but higher chance of random matches
Medium kmers (k=21-31): Good balance for most applications, default k=31
Maximum k=32: Highest specificity but requires exact matches, enforced by assert(k>0 && k<=32)

Memory Usage

Memory requirements scale with heap size and thread count:

Base Memory: Each SketchHeap entry stores long values (8 bytes) plus metadata
Thread Overhead: Each ProcessThread maintains localHeap of size tSize=heapSize/2
Genome Tracking: localHeap.genomeSizeBases and localHeap.genomeSequences track input statistics

Output Characteristics

The output has specific properties due to the limiting algorithm:

Approximate Limit: Final kmer count may slightly exceed target due to threading and batch synchronization
Read Completeness: All reads processed before dumpHeap() triggers termination are included
Pair Maintenance: Paired-end reads are kept together through r1.mate relationships in ProcessThread

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org