KmerLimit2

Script: kmerlimit2.sh Package: sketch Class: KmerLimit2.java

Subsamples reads to reach a target unique kmer limit using a two-pass approach that works with reads in any order.

Basic Usage

kmerlimit2.sh in=<input file> out=<output file> limit=<number>

KmerLimit2 subsamples reads to reach a target number of unique k-mers. Unlike kmerlimit.sh which uses 1 pass and requires reads to be in random order, kmerlimit2.sh uses 2 passes and randomly subsamples from the file, so it works with reads in any order.

Key Differences from KmerLimit

Parameters

Parameters are organized by their function in the kmer limiting process.

Standard parameters

in=<file>
Primary input, or read 1 input.
in2=<file>
Read 2 input if reads are in two files.
out=<file>
Primary output, or read 1 output.
out2=<file>
Read 2 output if reads are in two files.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file.
ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Processing parameters

k=31
Kmer length, 1-32. Default is 31.
limit=
The number of unique kmers to produce. This is a required parameter.
mincount=1
Ignore kmers seen fewer than this many times. Default is 1.
minqual=0
Ignore bases with quality below this. Default is 0 (no quality filtering).
minprob=0.2
Ignore kmers with correctness probability below this. Default is 0.2.
trials=25
Number of simulation trials used to estimate the target read subsample rate. More trials give more accurate estimates but take longer. Default is 25.
seed=-1
Random seed for deterministic output. Set to a positive number for reproducible results. Default is -1 (random seed).
maxlen=50m
Maximum length of a temporary array used in simulation. Limits memory usage during the simulation phase. Default is 50 million.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Usage

kmerlimit2.sh in=reads.fq out=subsampled.fq limit=1000000

Subsample reads to achieve approximately 1 million unique k-mers (k=31).

Paired-end Reads

kmerlimit2.sh in1=reads_R1.fq in2=reads_R2.fq out1=sub_R1.fq out2=sub_R2.fq limit=500000

Subsample paired-end reads to achieve approximately 500,000 unique k-mers.

Custom K-mer Length and Quality Filtering

kmerlimit2.sh in=reads.fq out=filtered.fq limit=2000000 k=21 minqual=10 minprob=0.5

Use k-mers of length 21, filter out bases with quality below 10, and ignore k-mers with correctness probability below 0.5.

Deterministic Output

kmerlimit2.sh in=reads.fq out=subsampled.fq limit=1000000 seed=12345 trials=50

Use a fixed seed for reproducible results and increase trials for more accurate subsampling rate estimation.

Algorithm Details

Two-Pass Approach

KmerLimit2 uses a two-pass algorithm that allows it to work with reads in any order, unlike the single-pass approach of kmerlimit.sh:

Pass 1: Kmer Counting and Rate Estimation

In the first pass, KmerLimit2:

Pass 2: Read Subsampling

In the second pass, KmerLimit2:

Monte Carlo Simulation Algorithm

The simulation phase uses a precise Monte Carlo approach to predict optimal subsampling rates:

Memory Management

KmerLimit2 implements several memory optimization strategies:

Quality-Aware K-mer Processing

When quality filtering is enabled (minqual > 1 or minprob > 0), KmerLimit2 uses probability-based k-mer validation:

Performance Characteristics

Support

For questions and support: