KmerFilterSet

Script: kmerfilterset.sh Package: jgi Class: KmerFilterSetMaker.java

Generates a set of kmers such that every input sequence will contain at least one kmer in the set. This is a greedy algorithm which retains the top X most common kmers each pass, and removes the sequences matching those kmers, so subsequent passes are faster. The output will not be optimally small but quite compact. The file size may be further decreased with kcompress.sh.

Basic Usage

kmerfilterset.sh in=<input file> out=<output file> k=<integer>

Creates a compact kmer filter set that ensures every input sequence contains at least one kmer from the set. Useful for creating filter files for downstream tools.

Parameters

Parameters are organized into functional groups. The tool uses a greedy multi-pass algorithm to identify the minimal kmer set needed to match all input sequences.

File parameters

in=<file>
Primary input file. Can be FASTA or FASTQ format, compressed or uncompressed.
out=<file>
Primary output file containing the kmer filter set in FASTA format.
temp=<file>
Temporary file pattern (optional). Must contain # symbol which will be replaced with pass number. If not specified, automatically generated temporary files are used. Example: temp=tempfile_#.fq
initial=<file>
Initial kmer set (optional). Pre-existing kmer file that can be used to accelerate the process by providing starting kmers. This is appended to rather than replaced.

Processing parameters

k=31
Kmer length. Typical values are 21-31. Longer kmers provide higher specificity but may require more passes. Must be ≤31 for standard kmer tables or >31 for extended tables.
rcomp=t
Consider forward and reverse-complement kmers identical. Setting to false will distinguish strand-specific kmers, typically doubling the required kmer set size.
minkpp=1
(minkmersperpass) Retain at least this many kmers per pass. Higher values accelerate processing but result in larger final sets. Must be ≥1.
maxkpp=2
(maxkmersperpass) Retain at most this many kmers per pass. Set to 0 for unlimited. Higher values create larger but potentially more comprehensive filter sets per pass.
mincount=1
Ignore kmers seen fewer than this many times in the current pass. Higher values focus on more abundant kmers but may miss important low-frequency sequences.
maxpasses=3000
Maximum number of passes to run. Each pass identifies high-frequency kmers and removes matching sequences. More passes create smaller sets but take longer.
maxns=BIG
Ignore sequences with more than this many Ns (ambiguous nucleotides). Set to a specific integer to filter heavily ambiguous sequences.
minlen=0
Ignore sequences shorter than this length. Useful for filtering very short sequences that may not contain meaningful kmers.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Large datasets may require substantial memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Recommended for batch processing to avoid hanging processes.
-da
Disable assertions. May provide minor performance improvement in production use.

Examples

Basic Filter Set Generation

kmerfilterset.sh in=reads.fq out=filter_kmers.fa k=25

Creates a kmer filter set from input reads using 25-mers. The output can be used with other BBTools for sequence filtering.

Optimized for Speed

kmerfilterset.sh in=large_dataset.fq out=quick_filter.fa k=31 maxkpp=10 minkpp=5

Generates a filter set more quickly by retaining 5-10 kmers per pass, resulting in a slightly larger but faster-to-generate set.

High-Quality Compact Set

kmerfilterset.sh in=sequences.fa out=compact_filter.fa k=27 mincount=3 maxpasses=5000

Creates a very compact filter set by requiring kmers to appear at least 3 times and allowing up to 5000 passes for maximum compression.

Using Initial Kmer Set

kmerfilterset.sh in=new_reads.fq out=combined_filter.fa initial=existing_filter.fa k=31

Extends an existing filter set with new sequences, useful for incremental updates to filter databases.

Algorithm Details

Greedy Multi-Pass Strategy

KmerFilterSetMaker implements a greedy multi-pass algorithm using fillHistogram() for frequency analysis and dumpKmersAsBytes_MT() for kmer retention to create compact kmer sets for sequence filtering:

Pass-Based Processing

The algorithm operates in multiple passes, each consisting of:

Adaptive Threshold Selection

Each pass dynamically determines the minimum count threshold to retain the target number of kmers:

Memory Management

The implementation includes several memory optimization strategies:

Output Characteristics

The resulting filter set has specific properties that make it suitable for downstream applications:

Performance Considerations

Processing time and memory usage scale with:

Integration with BBTools Ecosystem

The generated filter sets are designed for seamless integration with other BBTools:

Support

For questions and support: