KmerLimit

Script: kmerlimit.sh Package: sketch Class: KmerLimit.java

Stops producing reads when the unique kmer limit is reached. This is approximate. If the input has been Clumpified, the order should be randomized first with shuffle2.sh or sortbyname.sh with the flowcell flag.

Basic Usage

kmerlimit.sh in=<input file> out=<output file> limit=<number>

KmerLimit processes reads sequentially and outputs all reads until the unique kmer limit is reached. Input reads should be in random order with respect to sequence for accurate sampling.

Important Notes

Version Differences:

  • kmerlimit.sh uses 1 pass and outputs all reads until a limit is hit, meaning the input reads should be in random order with respect to sequence.
  • kmerlimit2.sh uses 2 passes and randomly subsamples from the file, so it works with reads in any order.

Parameters

Parameters are organized by their function in the kmer limiting process.

Standard Parameters

in=file
Primary input, or read 1 input.
in2=file
Read 2 input if reads are in two files.
out=file
Primary output, or read 1 output.
out2=file
Read 2 output if reads are in two files.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file.
ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Processing Parameters

k=31
Kmer length, 1-32. Controls the size of kmers used for tracking unique sequences.
limit=
The number of unique kmers to produce. This is the target limit that determines when to stop processing reads.
mincount=1
Ignore kmers seen fewer than this many times. Helps filter out sequencing errors by requiring kmers to appear multiple times.
minqual=0
Ignore bases with quality below this. Quality filtering helps improve kmer accuracy by excluding low-quality bases.
minprob=0.2
Ignore kmers with correctness probability below this. Uses quality scores to calculate the probability that a kmer is correct.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Kmer Limiting

kmerlimit.sh in=reads.fq out=limited.fq limit=1000000

Process reads until 1 million unique kmers are encountered, then stop. All reads contributing to those unique kmers will be included in the output.

Quality-Filtered Kmer Limiting

kmerlimit.sh in=reads.fq out=limited.fq limit=500000 minqual=10 minprob=0.5

Limit to 500k unique kmers while requiring minimum quality 10 and probability 0.5 for kmer correctness, reducing noise from sequencing errors.

Paired-End Processing

kmerlimit.sh in1=reads_R1.fq in2=reads_R2.fq out1=limited_R1.fq out2=limited_R2.fq limit=2000000

Process paired-end reads until 2 million unique kmers are reached, maintaining proper pairing in the output files.

Error-Resilient Limiting

kmerlimit.sh in=reads.fq out=limited.fq limit=1000000 mincount=3 k=21

Use shorter kmers (k=21) and require each kmer to appear at least 3 times to filter out most sequencing errors before counting toward the limit.

Algorithm Details

Kmer-Based Limiting Strategy

KmerLimit implements a SketchObject-based approach to track unique kmers with approximate limiting using multi-threaded ProcessThread instances:

Core Algorithm

Quality Integration

When quality scores are available and minprob > 0, the algorithm calculates kmer correctness probability:

Memory Management

Threading Architecture

Input Order Requirements

For accurate results, input reads should be randomized with respect to sequence similarity:

Technical Notes

Kmer Length Considerations

The choice of k affects sensitivity and specificity:

  • Shorter kmers (k=15-21): More sensitive to sequence similarity but higher chance of random matches
  • Medium kmers (k=21-31): Good balance for most applications, default k=31
  • Maximum k=32: Highest specificity but requires exact matches, enforced by assert(k>0 && k<=32)

Memory Usage

Memory requirements scale with heap size and thread count:

  • Base Memory: Each SketchHeap entry stores long values (8 bytes) plus metadata
  • Thread Overhead: Each ProcessThread maintains localHeap of size tSize=heapSize/2
  • Genome Tracking: localHeap.genomeSizeBases and localHeap.genomeSequences track input statistics

Output Characteristics

The output has specific properties due to the limiting algorithm:

  • Approximate Limit: Final kmer count may slightly exceed target due to threading and batch synchronization
  • Read Completeness: All reads processed before dumpHeap() triggers termination are included
  • Pair Maintenance: Paired-end reads are kept together through r1.mate relationships in ProcessThread

Support

For questions and support: