BBCms

Script: bbcms.sh Package: bloom Class: BloomFilterCorrectorWrapper.java

Error corrects reads and/or filters by depth, storing kmer counts in a count-min sketch (a Bloom filter variant). This uses a fixed amount of memory. The error-correction algorithm is taken from Tadpole; with plenty of memory, the behavior is almost identical to Tadpole. As the number of unique kmers in a dataset increases, the accuracy decreases such that it will make fewer corrections. It is still capable of making useful corrections far past the point where Tadpole would crash by running out of memory, even with the prefilter flag. But if there is sufficient memory to use Tadpole, then Tadpole is more desirable.

Basic Usage

bbcms.sh in=<input file> out=<output> outb=<reads failing filters>

BBCMS (Bloom filter Count-Min Sketch) is a memory-efficient tool for error correction and depth filtering of sequencing reads. It uses a count-min sketch data structure to track k-mer frequencies with fixed memory usage, making it suitable for very large datasets that would exceed memory limits with traditional approaches.

Parameters

Parameters are organized by their function in the BBCMS process. The tool provides comprehensive control over bloom filter construction, depth filtering, and error correction strategies.

File parameters

in=<file>: Primary input, or read 1 input.
in2=<file>: Read 2 input if reads are in two files.
out=<file>: Primary read output.
out2=<file>: Read 2 output if reads are in two files.
outb=<file>: (outbad/outlow) Output for reads failing mincount.
outb2=<file>: (outbad2/outlow2) Read 2 output if reads are in two files.
extra=<file>: Additional comma-delimited files for generating kmer counts.
ref=<file>: If ref is set, then only files in the ref list will be used for kmer counts, and the input files will NOT be used for counts; they will just be filtered or corrected.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file.

Hashing parameters

k=31: Kmer length, currently 1-31.
hashes=3: Number of hashes per kmer. Higher generally reduces false positives at the expense of speed; rapidly diminishing returns above 4.
ksmall=: Optional sub-kmer length; setting to slightly lower than k can improve memory efficiency by reducing the number of hashes needed. e.g. 'k=31 ksmall=29 hashes=2' has better speed and accuracy than 'k=31 hashes=3' when the filter is very full.
minprob=0.5: Ignore kmers with probability of being correct below this.
memmult=1.0: Fraction of free memory to use for Bloom filter. 1.0 should generally work; if the program crashes with an out of memory error, set this lower. You may be able to increase accuracy by setting it slightly higher.
cells=: Option to set the number of cells manually. By default this will be autoset to use all available memory. The only reason to set this is to ensure deterministic output.
seed=0: This will change the hash function used. Useful if running iteratively with a very full table. -1 uses a random seed.
symmetricwrite=t: (sw) Increases counting accuracy for a slight speed penalty. Could be slow on very low-complexity sequence.

Depth filtering parameters

mincount=0: If positive, reads with kmer counts below mincount will be discarded (sent to outb).
hcf=1.0: (highcountfraction) Fraction of kmers that must be at least mincount to pass.
requireboth=t: Require both reads in a pair to pass in order to go to out. When true, if either read has a count below mincount, both reads in the pair will go to outb. When false, reads will only go to outb if both fail.
tossjunk=f: Remove reads or pairs with outermost kmer depth below 2.

Error correction parameters

ecc=t: Perform error correction.
bits=: Bits used to store kmer counts; max count is 2^bits-1. Supports 2, 4, 8, 16, or 32. 16 is best for high-depth data; 2 or 4 are for huge, low-depth metagenomes that saturate the bloom filter otherwise. Generally 4 bits is recommended for error-correction and 2 bits is recommended for filtering only.
ecco=f: Error-correct paired reads by overlap prior to kmer-counting.
merge=t: Merge paired reads by overlap prior to kmer-counting, and again prior to correction. Output will still be unmerged.
smooth=3: Remove spikes from kmer counts due to hash collisions. The number is the max width of peaks to be smoothed; range is 0-3 (3 is most aggressive; 0 disables smoothing). This also affects tossjunk.
vstrict=t: Enable very strict overlap detection for read merging.
ustrict=f: Enable ultra-strict overlap detection for read merging.
testmerge=t: Test merged reads for quality before accepting the merge.
testmergewidth=4: Width parameter for merge testing.
testmergethresh=3: Threshold parameter for merge testing.
testmergemult=80: Multiplier parameter for merge testing.
pincer=t: Enable pincer error correction mode.
tail=t: Enable tail error correction mode.
reassemble=t: Enable reassembly error correction mode.
smoothwidth=3: Explicit setting for smoothing width (overrides smooth parameter).
maxload=1.0: Maximum load factor for the Bloom filter before terminating.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Error Correction Example

bbcms.sh in=reads.fq out=ecc.fq bits=4 hashes=3 k=31 merge

Performs error correction on paired-end reads using a 4-bit Bloom filter with 31-mer length. Reads are merged during processing but output as separate files.

Depth Filtering Example

bbcms.sh in=reads.fq out=high.fq outb=low.fq k=31 mincount=2 ecc=f hcf=0.4

Filters reads by k-mer depth without error correction. Reads with at least 40% of k-mers having depth ≥2 go to high.fq, others to low.fq.

Two-Pass Processing for Very Large Datasets

# First pass: filtering only
bbcms.sh in=reads.fq out=filtered.fq bits=2 tossjunk=t ecc=f mincount=2 hcf=0.4

# Second pass: error correction
bbcms.sh in=filtered.fq out=corrected.fq bits=4 hashes=3 k=31 merge

For extremely large datasets, use a two-pass approach: first filter with a 2-bit filter to remove low-quality data, then error-correct the remaining reads with a 4-bit filter.

Algorithm Details

Count-Min Sketch Architecture: BBCMS implements a count-min sketch using KCountArray7MTA with atomic integer arrays for thread-safe operations. The data structure uses configurable bit-packing (2-32 bits per cell) and multiple hash functions with KCountArray7MTA.setSeed() for hash variation. Memory allocation is fixed at initialization using Tools.mid(16, 128, (Shared.threads()*2)/3) for buffer sizing.

Memory Management: Memory usage is calculated by freeRam 4000m 84 in the shell script, allocating 84% of free RAM above 4GB baseline. The BloomFilter.OVERRIDE_CELLS parameter allows manual cell count specification. Memory remains constant throughout processing via the KCountArray fixed-size array architecture.

Error Correction Strategy: The BloomFilterCorrector class implements three correction modes with specific boolean flags:

Pincer mode (ECC_PINCER): basesCorrectedPincer tracks corrections using bidirectional k-mer analysis from read ends
Tail mode (ECC_TAIL): basesCorrectedTail counts end-focused corrections where error rates are highest
Reassemble mode (ECC_REASSEMBLE): basesCorrectedReassemble tracks reconstruction using high-confidence k-mer paths

Bloom Filter Implementation: Uses BloomFilter wrapper around KCountArray implementations with configurable hash functions via the hashes parameter. Hash seed variation through KCountArray7MTA.setSeed() enables multiple iterations. The minProb parameter (default 0.5) filters k-mers below probability threshold using ReadCounter.minProb.

Quality Control Integration: Smoothing removes hash collision artifacts using corrector.smoothWidth (range 0-3) to detect and correct spurious count spikes. The testMerge functionality validates merged reads using corrector.mergeOK() with configurable width (testMergeWidth=4), threshold (testMergeThresh=3), and multiplier (testMergeMult=80) parameters.

Overlap Detection: Read merging uses BBMerge.findOverlapStrict(), findOverlapVStrict(), or findOverlapUStrict() based on vstrict/ustrict parameters. Error correction by overlap (ecco) applies BBMerge.errorCorrectWithInsert() while merge mode uses r1.joinRead(insert) for sequence concatenation.

Performance Characteristics

Memory Usage: Memory allocation follows calcXmx function using freeRam 4000m 84 formula (84% of RAM above 4GB). Thread buffer sizing uses Tools.mid(16, 128, (Shared.threads()*2)/3) for output streams. Cell allocation via BloomFilter.OVERRIDE_CELLS or automatic sizing based on memFraction parameter.

Thread Safety: Uses KCountArray.LOCKED_INCREMENT=true for atomic operations. Processing threads spawned via ArrayList<ProcessThread> with thread-local storage: corrector.localTracker.get(), corrector.localLongList.get(), and corrector.localIntList.get() for k-mer tracking.

Load Monitoring: Real-time monitoring via filter.filter.usedFraction() with automatic termination when usedFraction2 > maxLoad. K-mer estimation uses filter.filter.estimateUniqueKmers(hashes) and estimateUniqueKmersFromUsedFraction() for depth analysis.

Collision Handling: Hash collision detection through symmetricwrite=t parameter for increased accuracy. Seed variation via KCountArray7MTA.setSeed() enables iterative processing with different hash functions. Smoothing width (corrector.smoothWidth) removes collision artifacts.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org