KmerFilterSet

Script: kmerfilterset.sh Package: jgi Class: KmerFilterSetMaker.java

Generates a set of kmers such that every input sequence will contain at least one kmer in the set. This is a greedy algorithm which retains the top X most common kmers each pass, and removes the sequences matching those kmers, so subsequent passes are faster. The output will not be optimally small but quite compact. The file size may be further decreased with kcompress.sh.

Basic Usage

kmerfilterset.sh in=<input file> out=<output file> k=<integer>

Creates a compact kmer filter set that ensures every input sequence contains at least one kmer from the set. Useful for creating filter files for downstream tools.

Parameters

Parameters are organized into functional groups. The tool uses a greedy multi-pass algorithm to identify the minimal kmer set needed to match all input sequences.

File parameters

in=<file>: Primary input file. Can be FASTA or FASTQ format, compressed or uncompressed.
out=<file>: Primary output file containing the kmer filter set in FASTA format.
temp=<file>: Temporary file pattern (optional). Must contain # symbol which will be replaced with pass number. If not specified, automatically generated temporary files are used. Example: temp=tempfile_#.fq
initial=<file>: Initial kmer set (optional). Pre-existing kmer file that can be used to accelerate the process by providing starting kmers. This is appended to rather than replaced.

Processing parameters

k=31: Kmer length. Typical values are 21-31. Longer kmers provide higher specificity but may require more passes. Must be ≤31 for standard kmer tables or >31 for extended tables.
rcomp=t: Consider forward and reverse-complement kmers identical. Setting to false will distinguish strand-specific kmers, typically doubling the required kmer set size.
minkpp=1: (minkmersperpass) Retain at least this many kmers per pass. Higher values accelerate processing but result in larger final sets. Must be ≥1.
maxkpp=2: (maxkmersperpass) Retain at most this many kmers per pass. Set to 0 for unlimited. Higher values create larger but potentially more comprehensive filter sets per pass.
mincount=1: Ignore kmers seen fewer than this many times in the current pass. Higher values focus on more abundant kmers but may miss important low-frequency sequences.
maxpasses=3000: Maximum number of passes to run. Each pass identifies high-frequency kmers and removes matching sequences. More passes create smaller sets but take longer.
maxns=BIG: Ignore sequences with more than this many Ns (ambiguous nucleotides). Set to a specific integer to filter heavily ambiguous sequences.
minlen=0: Ignore sequences shorter than this length. Useful for filtering very short sequences that may not contain meaningful kmers.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Large datasets may require substantial memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Recommended for batch processing to avoid hanging processes.
-da: Disable assertions. May provide minor performance improvement in production use.

Examples

Basic Filter Set Generation

kmerfilterset.sh in=reads.fq out=filter_kmers.fa k=25

Creates a kmer filter set from input reads using 25-mers. The output can be used with other BBTools for sequence filtering.

Optimized for Speed

kmerfilterset.sh in=large_dataset.fq out=quick_filter.fa k=31 maxkpp=10 minkpp=5

Generates a filter set more quickly by retaining 5-10 kmers per pass, resulting in a slightly larger but faster-to-generate set.

High-Quality Compact Set

kmerfilterset.sh in=sequences.fa out=compact_filter.fa k=27 mincount=3 maxpasses=5000

Creates a very compact filter set by requiring kmers to appear at least 3 times and allowing up to 5000 passes for maximum compression.

Using Initial Kmer Set

kmerfilterset.sh in=new_reads.fq out=combined_filter.fa initial=existing_filter.fa k=31

Extends an existing filter set with new sequences, useful for incremental updates to filter databases.

Algorithm Details

Greedy Multi-Pass Strategy

KmerFilterSetMaker implements a greedy multi-pass algorithm using fillHistogram() for frequency analysis and dumpKmersAsBytes_MT() for kmer retention to create compact kmer sets for sequence filtering:

Pass-Based Processing

The algorithm operates in multiple passes, each consisting of:

Kmer Counting: Uses KmerTableSet (k≤31) or KmerTableSetU (k>31) for kmer enumeration and counting
Frequency Analysis: Builds histograms to identify the most abundant kmers in the current pass
Selective Retention: Keeps only kmers above the minimum count threshold, retaining between minkpp and maxkpp kmers per pass
Sequence Filtering: Uses BBDuk.main() internally to remove sequences containing the selected kmers, reducing the dataset for subsequent passes

Adaptive Threshold Selection

Each pass dynamically determines the minimum count threshold to retain the target number of kmers:

Analyzes kmer frequency histogram to find optimal cutoff points
Ensures at least minkpp kmers are retained (unless fewer exist)
Caps retention at maxkpp kmers to prevent excessive growth
Stops processing when insufficient high-quality kmers remain

Memory Management

The implementation includes several memory optimization strategies:

Automatic Table Selection: Chooses between standard (≤31-mer) and extended (>31-mer) kmer tables based on k value
Progressive Filtering: Each pass reduces the working dataset size, decreasing memory requirements for subsequent passes
Temporary File Management: Uses disk-based temporary files to handle datasets larger than available RAM
Multi-threading: Leverages available CPU cores for kmer counting and sequence processing

Output Characteristics

The resulting filter set has specific properties that make it suitable for downstream applications:

Coverage Guarantee: Every input sequence contains at least one kmer from the final set
Compact Size: While not mathematically optimal, the greedy approach produces significantly smaller sets than naive approaches
High-Frequency Bias: Preferentially includes kmers that appear in multiple sequences, improving filter performance
Strand Consistency: When rcomp=true, maintains consistent representation of reverse-complement kmer pairs

Performance Considerations

Processing time and memory usage scale with:

Input Size: Larger inputs require more passes and memory
Sequence Diversity: Highly diverse sequences need more passes to achieve full coverage
Kmer Length: Longer kmers generally require more passes but produce more specific filters
Parameter Settings: Lower minkpp values produce smaller sets but require more passes

Integration with BBTools Ecosystem

The generated filter sets are designed for seamless integration with other BBTools:

BBDuk: Primary use case for sequence filtering and contamination removal
BBSplit: Can use filter sets for taxonomic or functional sequence binning
KCompress: Further reduces filter set size while maintaining effectiveness
Seal: Uses filter sets for rapid sequence matching and classification

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org