KmerLimit
Stops producing reads when the unique kmer limit is reached. This is approximate. If the input has been Clumpified, the order should be randomized first with shuffle2.sh or sortbyname.sh with the flowcell flag.
Basic Usage
kmerlimit.sh in=<input file> out=<output file> limit=<number>
KmerLimit processes reads sequentially and outputs all reads until the unique kmer limit is reached. Input reads should be in random order with respect to sequence for accurate sampling.
Important Notes
Version Differences:
- kmerlimit.sh uses 1 pass and outputs all reads until a limit is hit, meaning the input reads should be in random order with respect to sequence.
- kmerlimit2.sh uses 2 passes and randomly subsamples from the file, so it works with reads in any order.
Parameters
Parameters are organized by their function in the kmer limiting process.
Standard Parameters
- in=file
- Primary input, or read 1 input.
- in2=file
- Read 2 input if reads are in two files.
- out=file
- Primary output, or read 1 output.
- out2=file
- Read 2 output if reads are in two files.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
Processing Parameters
- k=31
- Kmer length, 1-32. Controls the size of kmers used for tracking unique sequences.
- limit=
- The number of unique kmers to produce. This is the target limit that determines when to stop processing reads.
- mincount=1
- Ignore kmers seen fewer than this many times. Helps filter out sequencing errors by requiring kmers to appear multiple times.
- minqual=0
- Ignore bases with quality below this. Quality filtering helps improve kmer accuracy by excluding low-quality bases.
- minprob=0.2
- Ignore kmers with correctness probability below this. Uses quality scores to calculate the probability that a kmer is correct.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Kmer Limiting
kmerlimit.sh in=reads.fq out=limited.fq limit=1000000
Process reads until 1 million unique kmers are encountered, then stop. All reads contributing to those unique kmers will be included in the output.
Quality-Filtered Kmer Limiting
kmerlimit.sh in=reads.fq out=limited.fq limit=500000 minqual=10 minprob=0.5
Limit to 500k unique kmers while requiring minimum quality 10 and probability 0.5 for kmer correctness, reducing noise from sequencing errors.
Paired-End Processing
kmerlimit.sh in1=reads_R1.fq in2=reads_R2.fq out1=limited_R1.fq out2=limited_R2.fq limit=2000000
Process paired-end reads until 2 million unique kmers are reached, maintaining proper pairing in the output files.
Error-Resilient Limiting
kmerlimit.sh in=reads.fq out=limited.fq limit=1000000 mincount=3 k=21
Use shorter kmers (k=21) and require each kmer to appear at least 3 times to filter out most sequencing errors before counting toward the limit.
Algorithm Details
Kmer-Based Limiting Strategy
KmerLimit implements a SketchObject-based approach to track unique kmers with approximate limiting using multi-threaded ProcessThread instances:
Core Algorithm
- Kmer Extraction: Slides a window of length k across each read using bit-shift operations (kmer<<2|x)&mask for forward, ((rkmer>>>2)|(x2<<shift2))&mask for reverse complement
- Canonical Kmers: Calls hash(kmer, rkmer) method from SketchObject to select lexicographically smaller representation
- SketchHeap Storage: Uses SketchHeap.add(hashcode) for kmer storage, with configurable heap sizes (4095 default, 32000 when mincount>1)
- Multi-threaded Processing: ProcessThread.dumpHeap() synchronizes local heaps with shared heap via synchronized(sharedHeap) blocks
Quality Integration
When quality scores are available and minprob > 0, the algorithm calculates kmer correctness probability:
- Base Quality Assessment: Uses align2.QualityTools.PROB_CORRECT[q] lookup table for per-base error probability conversion
- Kmer Probability: Multiplies base probabilities (prob=prob*align2.QualityTools.PROB_CORRECT[q]) across k-mer window
- Sliding Window Update: Updates probability by multiplying out old base with align2.QualityTools.PROB_CORRECT_INVERSE[oldq] when len>k
- Threshold Filtering: Only kmers with prob>=minProb pass localHeap.checkAndAdd(hashcode) and contribute to probSum tracking
Memory Management
- Adaptive Heap Size: heapSize_=4095 default, automatically set to 32000 when minCount_>1 (line 125 in constructor)
- Local Threading: Each ProcessThread gets tSize=heapSize/2 for local SketchHeap initialization to minimize contention
- Periodic Synchronization: dumpHeap() calls sharedHeap.genomeSizeEstimate(minCount) to check progress, then sharedHeap.add(localHeap) when count<targetKmers
- Early Termination: ProcessThread.processInner() breaks loop when dumpHeap() returns count>=targetKmers
Threading Architecture
- Thread Limitation: Tools.min(8, Shared.threads()) caps maximum thread usage at 8 threads regardless of system capacity
- Buffer Length Optimization: Shared.setBufferLen(800) when minCount>1 to handle increased I/O demands
- Concurrent Stream Processing: ConcurrentReadInputStream.getReadInputStream() with paired-end awareness
- Approximation Behavior: The limiting is approximate due to ProcessThread batch processing and synchronized heap dumps
Input Order Requirements
For accurate results, input reads should be randomized with respect to sequence similarity:
- Clumpified Data: If input has been processed with Clumpify, use shuffle2.sh or sortbyname.sh with flowcell flag first
- Random Sampling: Random order ensures the output represents a diverse subset of the input
- Alternative: Use kmerlimit2.sh if input order cannot be randomized, as it uses 2-pass random subsampling
Technical Notes
Kmer Length Considerations
The choice of k affects sensitivity and specificity:
- Shorter kmers (k=15-21): More sensitive to sequence similarity but higher chance of random matches
- Medium kmers (k=21-31): Good balance for most applications, default k=31
- Maximum k=32: Highest specificity but requires exact matches, enforced by assert(k>0 && k<=32)
Memory Usage
Memory requirements scale with heap size and thread count:
- Base Memory: Each SketchHeap entry stores long values (8 bytes) plus metadata
- Thread Overhead: Each ProcessThread maintains localHeap of size tSize=heapSize/2
- Genome Tracking: localHeap.genomeSizeBases and localHeap.genomeSequences track input statistics
Output Characteristics
The output has specific properties due to the limiting algorithm:
- Approximate Limit: Final kmer count may slightly exceed target due to threading and batch synchronization
- Read Completeness: All reads processed before dumpHeap() triggers termination are included
- Pair Maintenance: Paired-end reads are kept together through r1.mate relationships in ProcessThread
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org