KmerLimit2

Basic Usage

kmerlimit2.sh in=<input file> out=<output file> limit=<number>

KmerLimit2 subsamples reads to reach a target number of unique k-mers. Unlike kmerlimit.sh which uses 1 pass and requires reads to be in random order, kmerlimit2.sh uses 2 passes and randomly subsamples from the file, so it works with reads in any order.

Key Differences from KmerLimit

kmerlimit.sh: Uses 1 pass and outputs all reads until a limit is hit, meaning the input reads should be in random order with respect to sequence.
kmerlimit2.sh: Uses 2 passes and randomly subsamples from the file, so it works with reads in any order.

Parameters

Parameters are organized by their function in the kmer limiting process.

Standard parameters

in=<file>: Primary input, or read 1 input.
in2=<file>: Read 2 input if reads are in two files.
out=<file>: Primary output, or read 1 output.
out2=<file>: Read 2 output if reads are in two files.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Processing parameters

k=31: Kmer length, 1-32. Default is 31.
limit=: The number of unique kmers to produce. This is a required parameter.
mincount=1: Ignore kmers seen fewer than this many times. Default is 1.
minqual=0: Ignore bases with quality below this. Default is 0 (no quality filtering).
minprob=0.2: Ignore kmers with correctness probability below this. Default is 0.2.
trials=25: Number of simulation trials used to estimate the target read subsample rate. More trials give more accurate estimates but take longer. Default is 25.
seed=-1: Random seed for deterministic output. Set to a positive number for reproducible results. Default is -1 (random seed).
maxlen=50m: Maximum length of a temporary array used in simulation. Limits memory usage during the simulation phase. Default is 50 million.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Usage

kmerlimit2.sh in=reads.fq out=subsampled.fq limit=1000000

Subsample reads to achieve approximately 1 million unique k-mers (k=31).

Paired-end Reads

kmerlimit2.sh in1=reads_R1.fq in2=reads_R2.fq out1=sub_R1.fq out2=sub_R2.fq limit=500000

Subsample paired-end reads to achieve approximately 500,000 unique k-mers.

Custom K-mer Length and Quality Filtering

kmerlimit2.sh in=reads.fq out=filtered.fq limit=2000000 k=21 minqual=10 minprob=0.5

Use k-mers of length 21, filter out bases with quality below 10, and ignore k-mers with correctness probability below 0.5.

Deterministic Output

kmerlimit2.sh in=reads.fq out=subsampled.fq limit=1000000 seed=12345 trials=50

Use a fixed seed for reproducible results and increase trials for more accurate subsampling rate estimation.

Algorithm Details

Two-Pass Approach

KmerLimit2 uses a two-pass algorithm that allows it to work with reads in any order, unlike the single-pass approach of kmerlimit.sh:

Pass 1: Kmer Counting and Rate Estimation

In the first pass, KmerLimit2:

Processes all input reads to build a sketch of unique k-mers using a SketchHeap data structure
Counts k-mer frequencies while respecting quality and probability filters
Estimates the total number of unique k-mers in the dataset
Runs multiple simulation trials to determine the optimal subsampling rate needed to achieve the target k-mer count

Pass 2: Read Subsampling

In the second pass, KmerLimit2:

Re-reads the input file and applies the calculated subsampling rate
Uses the ConcurrentReadInputStream's built-in sampling capability for uniform random sampling
Outputs the subsampled reads while tracking actual k-mer diversity achieved

Monte Carlo Simulation Algorithm

The simulation phase uses a precise Monte Carlo approach to predict optimal subsampling rates:

Expanded Array Method: Creates a temporary integer array where each k-mer is represented once for each time it was observed, with array length equal to the sum of all k-mer counts
IntMap-based Tracking: Uses an IntMap data structure to efficiently track current k-mer counts during simulation, avoiding array copying overhead
Random Reduction Simulation: Performs multiple trials (default 25) where k-mer instances are randomly removed until target k-mer diversity is reached
Statistical Averaging: Calculates the fraction of reads to retain as: (1 - averageRoundsRemoved/totalKmerCounts) where averageRoundsRemoved is computed across all simulation trials

Memory Management

KmerLimit2 implements several memory optimization strategies:

SketchHeap: Uses a fixed-size heap (default 8091 entries, or 32000 when mincount > 1) to store the most frequent k-mers, preventing memory exhaustion on large datasets
Adaptive Heap Sizing: Automatically adjusts heap size based on mincount parameter - increases to 32000 entries when mincount > 1 to accommodate the reduced k-mer diversity
Simulation Array Capping: The maxlen parameter (default 50M) prevents the expanded simulation array from exceeding available memory, using capLengthAtCountSum() to truncate if necessary
Thread-Local Heaps: Each processing thread (up to 10 threads) maintains its own SketchHeap to minimize synchronization overhead, with results merged into the shared heap at completion
Hash-based K-mer Storage: Uses 64-bit hash codes rather than storing actual k-mer sequences, reducing memory footprint by ~4x for k=31

Quality-Aware K-mer Processing

When quality filtering is enabled (minqual > 1 or minprob > 0), KmerLimit2 uses probability-based k-mer validation:

Probability Calculation: Calculates the probability that each k-mer is correct by multiplying base quality probabilities: prob = ∏(PROB_CORRECT[quality_i]) for all bases in the k-mer
Sliding Window Efficiency: Uses a sliding window approach where probabilities are updated incrementally by multiplying by the new base probability and dividing by the probability of the base that left the window
Quality Filtering: Rejects k-mers containing any base with quality below minqual (default 0) or with overall correctness probability below minprob (default 0.2)
Probability Accumulation: Maintains a running sum of k-mer probabilities (probSum) in addition to counts for downstream quality assessment

Performance Characteristics

Time Complexity: O(2n + st) where n is total bases, s is simulation trials (default 25), and t is unique k-mer count - requires two complete passes through input data
Memory Usage: Peak memory = max(heap_size × 16 bytes, simulation_array_length × 4 bytes) where simulation array is capped at maxlen parameter (50M default)
Threading Model: Uses up to 10 processing threads per pass with thread-local SketchHeaps synchronized only during final merge operations
Accuracy Advantage: Monte Carlo simulation with multiple trials provides statistically robust subsampling rate estimates, unlike single-pass heuristic methods
I/O Efficiency: Second pass leverages ConcurrentReadInputStream's built-in sampling capability (setSampleRate) for uniform random read selection

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org