BloomFilter
Filters reads potentially sharing a kmer with a reference. The more memory, the higher the accuracy. Reads going to outu are guaranteed to not match the reference, but reads going to outm might may or may not match the reference. Can be used for contamination removal, depth filtering, or error correction combined with filtering.
Basic Usage
bloomfilter.sh in=<input file> out=<nonmatches> outm=<matches> ref=<reference>
The tool creates a Bloom filter from the reference sequences and filters input reads based on kmer matches. Reads that potentially match the reference go to outm, while guaranteed non-matches go to out.
Parameters
Parameters are organized by their function in the bloom filtering process. The tool uses probabilistic data structures for memory-efficient kmer matching.
File parameters
- in=<file>
- Primary input, or read 1 input.
- in2=<file>
- Read 2 input if reads are in two files.
- outm=<file>
- (out) Primary matched read output. Reads that potentially match the reference.
- outm2=<file>
- (out2) Matched read 2 output if reads are in two files.
- outu=<file>
- Primary unmatched read output. Reads guaranteed not to match the reference.
- outu2=<file>
- Unmatched read 2 output if reads are in two files.
- outc=<file>
- Optional output stream for kmer counts. Outputs kmers from input reads along with their counts from the reference.
- ref=<file>
- Reference sequence file, or a comma-delimited list. For depth-based filtering, set this to the same as the input. The Bloom filter is built from kmers in this reference.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
Hashing parameters
- k=31
- Kmer length. Longer kmers are more specific but require more memory and may reduce sensitivity for short matches.
- hashes=2
- Number of hash functions per kmer. Higher values generally reduce false positives at the expense of speed and memory usage.
- sw=t
- (symmetricwrite) Increases accuracy when bits>1 and hashes>1. Uses symmetric hash functions for better distribution when multiple hash functions and multiple bits per cell are employed together.
- minprob=0.5
- Ignore reference kmers with probability of being correct below this threshold (affects fastq references only). Quality-based filtering of reference kmers.
- memmult=1.0
- Fraction of free memory to use for Bloom filter. 1.0 should generally work; if the program crashes with an out of memory error, set this lower. Higher values increase specificity by reducing false positives.
- cells=
- Option to set the number of cells manually. By default this will be autoset to use all available memory. The only reason to set this is to ensure deterministic output across runs.
- seed=0
- This will change the hash function used. Different seeds produce different hash functions for the same input, allowing reproducible but varied results.
- bits=
- Bits per cell; it is set automatically from mincount. Higher bit counts allow tracking higher kmer frequencies but use more memory.
Reference-matching parameters
- minhits=3
- Consecutive kmer hits for a read to be considered matched. Higher values reduce false positives at the expense of sensitivity. Requires this many consecutive kmers to match the reference.
- mincount=1
- Minimum number of times a read kmer must occur in the reference to be considered a match (or printed to outc). Filters out rare kmers that may be errors.
- requireboth=f
- Require both reads in a pair to match the reference in order to go to outm. By default, pairs go to outm if either read matches.
Processing parameters
- ecco=f
- Error correction and overlap detection. Attempts to correct errors in overlapping read pairs before filtering.
- merge=f
- Merge overlapping read pairs into single reads before filtering. Can improve accuracy for overlapping paired-end reads.
- rcomp=t
- Consider reverse complement of kmers. Should normally be true for DNA sequences to handle both strands.
- ordered=f
- Output reads in input order. Setting to true uses more memory but preserves read order.
- verbose=f
- Print verbose status messages during processing. Shows keys counted and increment statistics when enabled.
Serialization parameters
- serialin=
- Load a pre-built Bloom filter from this file instead of building from reference. Saves time when reusing the same reference.
- serialout=
- Save the built Bloom filter to this file for later reuse. Allows building the filter once and reusing it multiple times.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic contamination removal
bloomfilter.sh in=reads.fq outm=nonhuman.fq out=human.fq k=31 minhits=3 ref=human.fa
Filters human contamination from metagenomic reads. Reads matching human reference go to human.fq, while non-matching reads go to nonhuman.fq.
Depth-based filtering
bloomfilter.sh in=reads.fq out=low_depth.fq outm=high_depth.fq ref=reads.fq mincount=3
Separates high and low depth reads by using the input as its own reference. Kmers occurring 3+ times are considered high-depth.
Error correction with filtering
bloomfilter.sh in=reads.fq in2=reads2.fq out=filtered.fq ref=reference.fa ecco=t merge=t
Performs error correction on overlapping paired reads before filtering against reference. Helps improve accuracy.
Memory-limited filtering
bloomfilter.sh in=reads.fq out=nonmatches.fq outm=matches.fq ref=large_ref.fa memmult=0.5
Uses only 50% of available memory for the Bloom filter. Useful when running alongside other memory-intensive processes.
Serialized filter reuse
# Build and save filter
bloomfilter.sh in=batch1.fq out=clean1.fq outm=contam1.fq ref=contaminants.fa serialout=contam.bloom
# Reuse saved filter
bloomfilter.sh in=batch2.fq out=clean2.fq outm=contam2.fq serialin=contam.bloom
Builds a Bloom filter once and reuses it for multiple datasets. Saves time when processing many files with the same reference.
Algorithm Details
Bloom Filter Implementation
BloomFilterWrapper uses a KCountArray-based probabilistic Bloom filter for memory-efficient k-mer matching. The implementation wraps KCountArray7MTA with atomic operations for thread-safe counting. Hash functions are configurable through KCountArray7MTA.setSeed() method, allowing reproducible but varied hash behavior.
K-mer Encoding Strategy
K-mers are encoded using bit-shifting operations with three key variables:
- shift: bitsPerBase * k (total bits for k-mer)
- shift2: shift - bitsPerBase (for reverse complement calculation)
- mask: Bit mask for k-mer extraction: ~((-1L)<<shift)
This encoding enables direct k-mer comparison using Tools.max(kmer, rkmer) for canonical k-mer representation.
Consecutive Hit Filtering Strategy
The matchesFilter() method implements a consecutive k-mer hit requirement:
- Reads are processed with overlapping k-mers using bit-shift encoding
- Each k-mer queries the BloomFilter.getCount() method
- Requires minConsecutiveMatches consecutive k-mers above minCount threshold
- Uses LongList buffer for tracking consecutive hits when minhits>1
Producer-Consumer Threading Architecture
Multi-threading implementation uses concurrent stream processing:
- ConcurrentReadInputStream: Parallel parsing of input sequences
- ProcessThread workers: Independent read filtering in parallel
- ConcurrentReadOutputStream: Thread-safe output writing with ordering
- ListNum<Read>: Batch processing of reads to reduce synchronization overhead
Memory Management and Sizing
Filter memory allocation uses setMemory() method with several factors:
- BloomFilter.OVERRIDE_CELLS allows manual cell count specification
- Memory fraction (memmult) controls percentage of available RAM usage
- Automatic bit-width adjustment: while minCount>0 && (1L<<bits)-1<minCount, bits*=2
- False positive rate depends on cells, hashes, and reference k-mer density
Error Correction Integration
Error correction uses BBMerge.findOverlapStrict() for paired-read processing:
- Merge mode: Joins overlapping reads using joinRead(insert) method
- ECCO mode: Corrects errors in overlap regions without merging
- Insert size detection: Overlap detection with strict alignment requirements
- Reverse complement handling: r2.reverseComplement() before joining
Count Output Generation
The addCounts() method generates k-mer frequency output using canonical representation:
- Processes reads using AminoAcid.baseToNumber and baseToComplementNumber arrays
- Calculates canonical k-mer: Tools.max(kmer, rkmer) when rcomp=true
- Queries filter.getCount(key) for each k-mer above minCount threshold
- Converts k-mers back to strings using AminoAcid.kmerToString(key, k)
Quality-based Reference Filtering
Reference quality filtering uses ReadCounter.minProb threshold:
- Applies only to FASTQ reference files with quality scores
- Calculates per-base probability of correctness from quality values
- Excludes k-mers with cumulative probability below minprob threshold
- Reduces false positives from low-quality reference sequences
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org