KmerFilterSet
Generates a set of kmers such that every input sequence will contain at least one kmer in the set. This is a greedy algorithm which retains the top X most common kmers each pass, and removes the sequences matching those kmers, so subsequent passes are faster. The output will not be optimally small but quite compact. The file size may be further decreased with kcompress.sh.
Basic Usage
kmerfilterset.sh in=<input file> out=<output file> k=<integer>
Creates a compact kmer filter set that ensures every input sequence contains at least one kmer from the set. Useful for creating filter files for downstream tools.
Parameters
Parameters are organized into functional groups. The tool uses a greedy multi-pass algorithm to identify the minimal kmer set needed to match all input sequences.
File parameters
- in=<file>
- Primary input file. Can be FASTA or FASTQ format, compressed or uncompressed.
- out=<file>
- Primary output file containing the kmer filter set in FASTA format.
- temp=<file>
- Temporary file pattern (optional). Must contain # symbol which will be replaced with pass number. If not specified, automatically generated temporary files are used. Example: temp=tempfile_#.fq
- initial=<file>
- Initial kmer set (optional). Pre-existing kmer file that can be used to accelerate the process by providing starting kmers. This is appended to rather than replaced.
Processing parameters
- k=31
- Kmer length. Typical values are 21-31. Longer kmers provide higher specificity but may require more passes. Must be ≤31 for standard kmer tables or >31 for extended tables.
- rcomp=t
- Consider forward and reverse-complement kmers identical. Setting to false will distinguish strand-specific kmers, typically doubling the required kmer set size.
- minkpp=1
- (minkmersperpass) Retain at least this many kmers per pass. Higher values accelerate processing but result in larger final sets. Must be ≥1.
- maxkpp=2
- (maxkmersperpass) Retain at most this many kmers per pass. Set to 0 for unlimited. Higher values create larger but potentially more comprehensive filter sets per pass.
- mincount=1
- Ignore kmers seen fewer than this many times in the current pass. Higher values focus on more abundant kmers but may miss important low-frequency sequences.
- maxpasses=3000
- Maximum number of passes to run. Each pass identifies high-frequency kmers and removes matching sequences. More passes create smaller sets but take longer.
- maxns=BIG
- Ignore sequences with more than this many Ns (ambiguous nucleotides). Set to a specific integer to filter heavily ambiguous sequences.
- minlen=0
- Ignore sequences shorter than this length. Useful for filtering very short sequences that may not contain meaningful kmers.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Large datasets may require substantial memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Recommended for batch processing to avoid hanging processes.
- -da
- Disable assertions. May provide minor performance improvement in production use.
Examples
Basic Filter Set Generation
kmerfilterset.sh in=reads.fq out=filter_kmers.fa k=25
Creates a kmer filter set from input reads using 25-mers. The output can be used with other BBTools for sequence filtering.
Optimized for Speed
kmerfilterset.sh in=large_dataset.fq out=quick_filter.fa k=31 maxkpp=10 minkpp=5
Generates a filter set more quickly by retaining 5-10 kmers per pass, resulting in a slightly larger but faster-to-generate set.
High-Quality Compact Set
kmerfilterset.sh in=sequences.fa out=compact_filter.fa k=27 mincount=3 maxpasses=5000
Creates a very compact filter set by requiring kmers to appear at least 3 times and allowing up to 5000 passes for maximum compression.
Using Initial Kmer Set
kmerfilterset.sh in=new_reads.fq out=combined_filter.fa initial=existing_filter.fa k=31
Extends an existing filter set with new sequences, useful for incremental updates to filter databases.
Algorithm Details
Greedy Multi-Pass Strategy
KmerFilterSetMaker implements a greedy multi-pass algorithm using fillHistogram() for frequency analysis and dumpKmersAsBytes_MT() for kmer retention to create compact kmer sets for sequence filtering:
Pass-Based Processing
The algorithm operates in multiple passes, each consisting of:
- Kmer Counting: Uses KmerTableSet (k≤31) or KmerTableSetU (k>31) for kmer enumeration and counting
- Frequency Analysis: Builds histograms to identify the most abundant kmers in the current pass
- Selective Retention: Keeps only kmers above the minimum count threshold, retaining between minkpp and maxkpp kmers per pass
- Sequence Filtering: Uses BBDuk.main() internally to remove sequences containing the selected kmers, reducing the dataset for subsequent passes
Adaptive Threshold Selection
Each pass dynamically determines the minimum count threshold to retain the target number of kmers:
- Analyzes kmer frequency histogram to find optimal cutoff points
- Ensures at least minkpp kmers are retained (unless fewer exist)
- Caps retention at maxkpp kmers to prevent excessive growth
- Stops processing when insufficient high-quality kmers remain
Memory Management
The implementation includes several memory optimization strategies:
- Automatic Table Selection: Chooses between standard (≤31-mer) and extended (>31-mer) kmer tables based on k value
- Progressive Filtering: Each pass reduces the working dataset size, decreasing memory requirements for subsequent passes
- Temporary File Management: Uses disk-based temporary files to handle datasets larger than available RAM
- Multi-threading: Leverages available CPU cores for kmer counting and sequence processing
Output Characteristics
The resulting filter set has specific properties that make it suitable for downstream applications:
- Coverage Guarantee: Every input sequence contains at least one kmer from the final set
- Compact Size: While not mathematically optimal, the greedy approach produces significantly smaller sets than naive approaches
- High-Frequency Bias: Preferentially includes kmers that appear in multiple sequences, improving filter performance
- Strand Consistency: When rcomp=true, maintains consistent representation of reverse-complement kmer pairs
Performance Considerations
Processing time and memory usage scale with:
- Input Size: Larger inputs require more passes and memory
- Sequence Diversity: Highly diverse sequences need more passes to achieve full coverage
- Kmer Length: Longer kmers generally require more passes but produce more specific filters
- Parameter Settings: Lower minkpp values produce smaller sets but require more passes
Integration with BBTools Ecosystem
The generated filter sets are designed for seamless integration with other BBTools:
- BBDuk: Primary use case for sequence filtering and contamination removal
- BBSplit: Can use filter sets for taxonomic or functional sequence binning
- KCompress: Further reduces filter set size while maintaining effectiveness
- Seal: Uses filter sets for rapid sequence matching and classification
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org