KmerLimit2
Subsamples reads to reach a target unique kmer limit using a two-pass approach that works with reads in any order.
Basic Usage
kmerlimit2.sh in=<input file> out=<output file> limit=<number>
KmerLimit2 subsamples reads to reach a target number of unique k-mers. Unlike kmerlimit.sh which uses 1 pass and requires reads to be in random order, kmerlimit2.sh uses 2 passes and randomly subsamples from the file, so it works with reads in any order.
Key Differences from KmerLimit
- kmerlimit.sh: Uses 1 pass and outputs all reads until a limit is hit, meaning the input reads should be in random order with respect to sequence.
- kmerlimit2.sh: Uses 2 passes and randomly subsamples from the file, so it works with reads in any order.
Parameters
Parameters are organized by their function in the kmer limiting process.
Standard parameters
- in=<file>
- Primary input, or read 1 input.
- in2=<file>
- Read 2 input if reads are in two files.
- out=<file>
- Primary output, or read 1 output.
- out2=<file>
- Read 2 output if reads are in two files.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
Processing parameters
- k=31
- Kmer length, 1-32. Default is 31.
- limit=
- The number of unique kmers to produce. This is a required parameter.
- mincount=1
- Ignore kmers seen fewer than this many times. Default is 1.
- minqual=0
- Ignore bases with quality below this. Default is 0 (no quality filtering).
- minprob=0.2
- Ignore kmers with correctness probability below this. Default is 0.2.
- trials=25
- Number of simulation trials used to estimate the target read subsample rate. More trials give more accurate estimates but take longer. Default is 25.
- seed=-1
- Random seed for deterministic output. Set to a positive number for reproducible results. Default is -1 (random seed).
- maxlen=50m
- Maximum length of a temporary array used in simulation. Limits memory usage during the simulation phase. Default is 50 million.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Usage
kmerlimit2.sh in=reads.fq out=subsampled.fq limit=1000000
Subsample reads to achieve approximately 1 million unique k-mers (k=31).
Paired-end Reads
kmerlimit2.sh in1=reads_R1.fq in2=reads_R2.fq out1=sub_R1.fq out2=sub_R2.fq limit=500000
Subsample paired-end reads to achieve approximately 500,000 unique k-mers.
Custom K-mer Length and Quality Filtering
kmerlimit2.sh in=reads.fq out=filtered.fq limit=2000000 k=21 minqual=10 minprob=0.5
Use k-mers of length 21, filter out bases with quality below 10, and ignore k-mers with correctness probability below 0.5.
Deterministic Output
kmerlimit2.sh in=reads.fq out=subsampled.fq limit=1000000 seed=12345 trials=50
Use a fixed seed for reproducible results and increase trials for more accurate subsampling rate estimation.
Algorithm Details
Two-Pass Approach
KmerLimit2 uses a two-pass algorithm that allows it to work with reads in any order, unlike the single-pass approach of kmerlimit.sh:
Pass 1: Kmer Counting and Rate Estimation
In the first pass, KmerLimit2:
- Processes all input reads to build a sketch of unique k-mers using a SketchHeap data structure
- Counts k-mer frequencies while respecting quality and probability filters
- Estimates the total number of unique k-mers in the dataset
- Runs multiple simulation trials to determine the optimal subsampling rate needed to achieve the target k-mer count
Pass 2: Read Subsampling
In the second pass, KmerLimit2:
- Re-reads the input file and applies the calculated subsampling rate
- Uses the ConcurrentReadInputStream's built-in sampling capability for uniform random sampling
- Outputs the subsampled reads while tracking actual k-mer diversity achieved
Monte Carlo Simulation Algorithm
The simulation phase uses a precise Monte Carlo approach to predict optimal subsampling rates:
- Expanded Array Method: Creates a temporary integer array where each k-mer is represented once for each time it was observed, with array length equal to the sum of all k-mer counts
- IntMap-based Tracking: Uses an IntMap data structure to efficiently track current k-mer counts during simulation, avoiding array copying overhead
- Random Reduction Simulation: Performs multiple trials (default 25) where k-mer instances are randomly removed until target k-mer diversity is reached
- Statistical Averaging: Calculates the fraction of reads to retain as: (1 - averageRoundsRemoved/totalKmerCounts) where averageRoundsRemoved is computed across all simulation trials
Memory Management
KmerLimit2 implements several memory optimization strategies:
- SketchHeap: Uses a fixed-size heap (default 8091 entries, or 32000 when mincount > 1) to store the most frequent k-mers, preventing memory exhaustion on large datasets
- Adaptive Heap Sizing: Automatically adjusts heap size based on mincount parameter - increases to 32000 entries when mincount > 1 to accommodate the reduced k-mer diversity
- Simulation Array Capping: The maxlen parameter (default 50M) prevents the expanded simulation array from exceeding available memory, using capLengthAtCountSum() to truncate if necessary
- Thread-Local Heaps: Each processing thread (up to 10 threads) maintains its own SketchHeap to minimize synchronization overhead, with results merged into the shared heap at completion
- Hash-based K-mer Storage: Uses 64-bit hash codes rather than storing actual k-mer sequences, reducing memory footprint by ~4x for k=31
Quality-Aware K-mer Processing
When quality filtering is enabled (minqual > 1 or minprob > 0), KmerLimit2 uses probability-based k-mer validation:
- Probability Calculation: Calculates the probability that each k-mer is correct by multiplying base quality probabilities: prob = ∏(PROB_CORRECT[quality_i]) for all bases in the k-mer
- Sliding Window Efficiency: Uses a sliding window approach where probabilities are updated incrementally by multiplying by the new base probability and dividing by the probability of the base that left the window
- Quality Filtering: Rejects k-mers containing any base with quality below minqual (default 0) or with overall correctness probability below minprob (default 0.2)
- Probability Accumulation: Maintains a running sum of k-mer probabilities (probSum) in addition to counts for downstream quality assessment
Performance Characteristics
- Time Complexity: O(2n + st) where n is total bases, s is simulation trials (default 25), and t is unique k-mer count - requires two complete passes through input data
- Memory Usage: Peak memory = max(heap_size × 16 bytes, simulation_array_length × 4 bytes) where simulation array is capped at maxlen parameter (50M default)
- Threading Model: Uses up to 10 processing threads per pass with thread-local SketchHeaps synchronized only during final merge operations
- Accuracy Advantage: Monte Carlo simulation with multiple trials provides statistically robust subsampling rate estimates, unlike single-pass heuristic methods
- I/O Efficiency: Second pass leverages ConcurrentReadInputStream's built-in sampling capability (setSampleRate) for uniform random read selection
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org