BloomFilter

Script: bloomfilter.sh Package: bloom Class: BloomFilterWrapper.java

Filters reads potentially sharing a kmer with a reference. The more memory, the higher the accuracy. Reads going to outu are guaranteed to not match the reference, but reads going to outm might may or may not match the reference. Can be used for contamination removal, depth filtering, or error correction combined with filtering.

Basic Usage

bloomfilter.sh in=<input file> out=<nonmatches> outm=<matches> ref=<reference>

The tool creates a Bloom filter from the reference sequences and filters input reads based on kmer matches. Reads that potentially match the reference go to outm, while guaranteed non-matches go to out.

Parameters

Parameters are organized by their function in the bloom filtering process. The tool uses probabilistic data structures for memory-efficient kmer matching.

File parameters

in=<file>: Primary input, or read 1 input.
in2=<file>: Read 2 input if reads are in two files.
outm=<file>: (out) Primary matched read output. Reads that potentially match the reference.
outm2=<file>: (out2) Matched read 2 output if reads are in two files.
outu=<file>: Primary unmatched read output. Reads guaranteed not to match the reference.
outu2=<file>: Unmatched read 2 output if reads are in two files.
outc=<file>: Optional output stream for kmer counts. Outputs kmers from input reads along with their counts from the reference.
ref=<file>: Reference sequence file, or a comma-delimited list. For depth-based filtering, set this to the same as the input. The Bloom filter is built from kmers in this reference.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file.

Hashing parameters

k=31: Kmer length. Longer kmers are more specific but require more memory and may reduce sensitivity for short matches.
hashes=2: Number of hash functions per kmer. Higher values generally reduce false positives at the expense of speed and memory usage.
sw=t: (symmetricwrite) Increases accuracy when bits>1 and hashes>1. Uses symmetric hash functions for better distribution when multiple hash functions and multiple bits per cell are employed together.
minprob=0.5: Ignore reference kmers with probability of being correct below this threshold (affects fastq references only). Quality-based filtering of reference kmers.
memmult=1.0: Fraction of free memory to use for Bloom filter. 1.0 should generally work; if the program crashes with an out of memory error, set this lower. Higher values increase specificity by reducing false positives.
cells=: Option to set the number of cells manually. By default this will be autoset to use all available memory. The only reason to set this is to ensure deterministic output across runs.
seed=0: This will change the hash function used. Different seeds produce different hash functions for the same input, allowing reproducible but varied results.
bits=: Bits per cell; it is set automatically from mincount. Higher bit counts allow tracking higher kmer frequencies but use more memory.

Reference-matching parameters

minhits=3: Consecutive kmer hits for a read to be considered matched. Higher values reduce false positives at the expense of sensitivity. Requires this many consecutive kmers to match the reference.
mincount=1: Minimum number of times a read kmer must occur in the reference to be considered a match (or printed to outc). Filters out rare kmers that may be errors.
requireboth=f: Require both reads in a pair to match the reference in order to go to outm. By default, pairs go to outm if either read matches.

Processing parameters

ecco=f: Error correction and overlap detection. Attempts to correct errors in overlapping read pairs before filtering.
merge=f: Merge overlapping read pairs into single reads before filtering. Can improve accuracy for overlapping paired-end reads.
rcomp=t: Consider reverse complement of kmers. Should normally be true for DNA sequences to handle both strands.
ordered=f: Output reads in input order. Setting to true uses more memory but preserves read order.
verbose=f: Print verbose status messages during processing. Shows keys counted and increment statistics when enabled.

Serialization parameters

serialin=: Load a pre-built Bloom filter from this file instead of building from reference. Saves time when reusing the same reference.
serialout=: Save the built Bloom filter to this file for later reuse. Allows building the filter once and reusing it multiple times.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic contamination removal

bloomfilter.sh in=reads.fq outm=nonhuman.fq out=human.fq k=31 minhits=3 ref=human.fa

Filters human contamination from metagenomic reads. Reads matching human reference go to human.fq, while non-matching reads go to nonhuman.fq.

Depth-based filtering

bloomfilter.sh in=reads.fq out=low_depth.fq outm=high_depth.fq ref=reads.fq mincount=3

Separates high and low depth reads by using the input as its own reference. Kmers occurring 3+ times are considered high-depth.

Error correction with filtering

bloomfilter.sh in=reads.fq in2=reads2.fq out=filtered.fq ref=reference.fa ecco=t merge=t

Performs error correction on overlapping paired reads before filtering against reference. Helps improve accuracy.

Memory-limited filtering

bloomfilter.sh in=reads.fq out=nonmatches.fq outm=matches.fq ref=large_ref.fa memmult=0.5

Uses only 50% of available memory for the Bloom filter. Useful when running alongside other memory-intensive processes.

Serialized filter reuse

# Build and save filter
bloomfilter.sh in=batch1.fq out=clean1.fq outm=contam1.fq ref=contaminants.fa serialout=contam.bloom

# Reuse saved filter
bloomfilter.sh in=batch2.fq out=clean2.fq outm=contam2.fq serialin=contam.bloom

Builds a Bloom filter once and reuses it for multiple datasets. Saves time when processing many files with the same reference.

Algorithm Details

Bloom Filter Implementation

BloomFilterWrapper uses a KCountArray-based probabilistic Bloom filter for memory-efficient k-mer matching. The implementation wraps KCountArray7MTA with atomic operations for thread-safe counting. Hash functions are configurable through KCountArray7MTA.setSeed() method, allowing reproducible but varied hash behavior.

K-mer Encoding Strategy

K-mers are encoded using bit-shifting operations with three key variables:

shift: bitsPerBase * k (total bits for k-mer)
shift2: shift - bitsPerBase (for reverse complement calculation)
mask: Bit mask for k-mer extraction: ~((-1L)<<shift)

This encoding enables direct k-mer comparison using Tools.max(kmer, rkmer) for canonical k-mer representation.

Consecutive Hit Filtering Strategy

The matchesFilter() method implements a consecutive k-mer hit requirement:

Reads are processed with overlapping k-mers using bit-shift encoding
Each k-mer queries the BloomFilter.getCount() method
Requires minConsecutiveMatches consecutive k-mers above minCount threshold
Uses LongList buffer for tracking consecutive hits when minhits>1

Producer-Consumer Threading Architecture

Multi-threading implementation uses concurrent stream processing:

ConcurrentReadInputStream: Parallel parsing of input sequences
ProcessThread workers: Independent read filtering in parallel
ConcurrentReadOutputStream: Thread-safe output writing with ordering
ListNum<Read>: Batch processing of reads to reduce synchronization overhead

Memory Management and Sizing

Filter memory allocation uses setMemory() method with several factors:

BloomFilter.OVERRIDE_CELLS allows manual cell count specification
Memory fraction (memmult) controls percentage of available RAM usage
Automatic bit-width adjustment: while minCount>0 && (1L<<bits)-1<minCount, bits*=2
False positive rate depends on cells, hashes, and reference k-mer density

Error Correction Integration

Error correction uses BBMerge.findOverlapStrict() for paired-read processing:

Merge mode: Joins overlapping reads using joinRead(insert) method
ECCO mode: Corrects errors in overlap regions without merging
Insert size detection: Overlap detection with strict alignment requirements
Reverse complement handling: r2.reverseComplement() before joining

Count Output Generation

The addCounts() method generates k-mer frequency output using canonical representation:

Processes reads using AminoAcid.baseToNumber and baseToComplementNumber arrays
Calculates canonical k-mer: Tools.max(kmer, rkmer) when rcomp=true
Queries filter.getCount(key) for each k-mer above minCount threshold
Converts k-mers back to strings using AminoAcid.kmerToString(key, k)

Quality-based Reference Filtering

Reference quality filtering uses ReadCounter.minProb threshold:

Applies only to FASTQ reference files with quality scores
Calculates per-base probability of correctness from quality values
Excludes k-mers with cumulative probability below minprob threshold
Reduces false positives from low-quality reference sequences

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org