BBNorm
Normalizes coverage by down-sampling reads over high-depth areas to achieve flat coverage distribution. Accelerates assembly and often improves assembly quality. Uses Count-Min Sketch probabilistic data structure for memory-bounded operation on unlimited dataset sizes.
Overview
BBNorm is designed to normalize coverage by down-sampling reads over high-depth areas of a genome, resulting in a flat coverage distribution. This process can dramatically accelerate assembly and render intractable datasets tractable, while often improving assembly quality.
Key Capabilities
- Coverage Normalization: Flatten uneven coverage distributions for optimal assembly
- Error Correction: Kmer-based error correction (though Tadpole is preferred when memory allows)
- Depth Binning: Separate reads into low, medium, and high depth categories
- Histogram Generation: Create kmer frequency histograms for genome size estimation
- Quality Filtering: Remove error-containing reads based on kmer depth profiles
Notable Features
- Never Runs Out of Memory: Uses Count-Min Sketch probabilistic data structure that gracefully degrades accuracy rather than failing
- Multipass Processing: Reduces average error rate in normalized output (unlike standard normalization which enriches for errors)
- Unlimited Kmer Lengths: Supports arbitrarily high k-mer sizes beyond the typical 32-mer limit
- Multiple Shell Scripts: Three convenience scripts (bbnorm.sh, ecc.sh, khist.sh) with optimized defaults for different use cases
When to Use BBNorm
Recommended Use Cases
- Assembly preprocessing when you have too much data (e.g., 600x coverage when you only want 100x)
- Uneven coverage datasets such as amplified single-cell, RNA-seq, viruses, or metagenomes
- Memory-limited error correction when Tadpole cannot fit in available memory
- Genome size estimation via kmer frequency histogram analysis
- Quality filtering to remove reads with aberrant kmer depth patterns
When NOT to Use BBNorm
Important: Never normalize data for these applications:
- Quantitative analysis: ChIP-seq, RNA-seq expression profiling, or any coverage-dependent analysis
- Variant discovery: Mapping for SNP/indel calling (will introduce bias)
- High-error platforms: PacBio or Nanopore data (designed for low-error, fixed-length reads)
- Rare variant detection: May correct minority alleles into majority alleles
- Optimal coverage: Smooth coverage at appropriate depth (BBNorm cannot inflate coverage)
Alternative: Use random subsampling instead of normalization when you need to reduce data volume for these applications.
Basic Usage
bbnorm.sh in=<input> out=<reads to keep> outt=<reads to toss> hist=<histogram output>
Related Scripts
- bbnorm.sh: General normalization with default 2-pass processing
- ecc.sh: Error correction mode with optimized defaults (equivalent to
bbnorm.sh ecc=t keepall passes=1 bits=16 prefilter
) - khist.sh: Histogram generation mode (equivalent to
bbnorm.sh passes=1 prefilter minprob=0 minqual=0 mindepth=0
)
Parameters
Input Parameters
- in=null
- Primary input. Use in2 for paired reads in a second file
- in2=null
- Second input file for paired reads in two files
- extra=null
- Additional files to use for input (generating hash table) but not for output
- fastareadlen=2^31
- Break up FASTA reads longer than this. Can be useful when processing scaffolded genomes
- tablereads=-1
- Use at most this many reads when building the hashtable (-1 means all)
- kmersample=1
- Process every nth kmer, and skip the rest
- readsample=1
- Process every nth read, and skip the rest
- interleaved=auto
- May be set to true or false to force the input read file to override autodetection of the input file as paired interleaved
- qin=auto
- ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto
Output Parameters
- out=<file>
- File for normalized or corrected reads. Use out2 for paired reads in a second file
- outt=<file>
- (outtoss) File for reads that were excluded from primary output
- reads=-1
- Only process this number of reads, then quit (-1 means all)
- sampleoutput=t
- Use sampling on output as well as input (not used if sample rates are 1)
- keepall=f
- Set to true to keep all reads (e.g. if you just want error correction)
- zerobin=f
- Set to true if you want kmers with a count of 0 to go in the 0 bin instead of the 1 bin in histograms. Default is false, to prevent confusion about how there can be 0-count kmers. The reason is that based on the 'minq' and 'minprob' settings, some kmers may be excluded from the bloom filter
- tmpdir=$TMPDIR
- This will specify a directory for temp files (only needed for multipass runs). If null, they will be written to the output directory
- usetempdir=t
- Allows enabling/disabling of temporary directory; if disabled, temp files will be written to the output directory
- qout=auto
- ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input)
- rename=f
- Rename reads based on their kmer depth
Hashing Parameters
- k=31
- Kmer length (values under 32 are most efficient, but arbitrarily high values are supported)
- bits=32
- Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32. Maximum kmer depth recorded is 2^cbits. Automatically reduced to 16 in 2-pass. Large values decrease accuracy for a fixed amount of memory, so use the lowest number you can that will still capture highest-depth kmers
- hashes=3
- Number of times each kmer is hashed and stored. Higher is slower. Higher is MORE accurate if there is enough memory, and LESS accurate if there is not enough memory
- prefilter=f
- True is slower, but generally more accurate; filters out low-depth kmers from the main hashtable. The prefilter is more memory-efficient because it uses 2-bit cells
- prehashes=2
- Number of hashes for prefilter
- prefilterbits=2
- (pbits) Bits per cell in prefilter
- prefiltersize=0.35
- Fraction of memory to allocate to prefilter
- buildpasses=1
- More passes can sometimes increase accuracy by iteratively removing low-depth kmers
- minq=6
- Ignore kmers containing bases with quality below this
- minprob=0.5
- Ignore kmers with overall probability of correctness below this
- threads=auto
- (t) Spawn exactly X hashing threads (default is number of logical processors). Total active threads may exceed X due to I/O threads
- rdk=t
- (removeduplicatekmers) When true, a kmer's count will only be incremented once per read pair, even if that kmer occurs more than once
Normalization Parameters
- fixspikes=f
- (fs) Do a slower, high-precision bloom filter lookup of kmers that appear to have an abnormally high depth due to collisions
- target=100
- (tgt) Target normalization depth. NOTE: All depth parameters control kmer depth, not read depth. For kmer depth Dk, read depth Dr, read length R, and kmer size K: Dr=Dk*(R/(R-K+1))
- maxdepth=-1
- (max) Reads will not be downsampled when below this depth, even if they are above the target depth
- mindepth=5
- (min) Kmers with depth below this number will not be included when calculating the depth of a read
- minkmers=15
- (mgkpr) Reads must have at least this many kmers over min depth to be retained. Aka 'mingoodkmersperread'
- percentile=54.0
- (dp) Read depth is by default inferred from the 54th percentile of kmer depth, but this may be changed to any number 1-100
- uselowerdepth=t
- (uld) For pairs, use the depth of the lower read as the depth proxy
- deterministic=t
- (dr) Generate random numbers deterministically to ensure identical output between multiple runs. May decrease speed with a huge number of threads
- passes=2
- (p) 1 pass is the basic mode. 2 passes (default) allows greater accuracy, error detection, better control of output depth
Error Detection Parameters
- hdp=90.0
- (highdepthpercentile) Position in sorted kmer depth array used as proxy of a read's high kmer depth
- ldp=25.0
- (lowdepthpercentile) Position in sorted kmer depth array used as proxy of a read's low kmer depth
- tossbadreads=f
- (tbr) Throw away reads detected as containing errors
- requirebothbad=f
- (rbb) Only toss bad pairs if both reads are bad
- errordetectratio=125
- (edr) Reads with a ratio of at least this much between their high and low depth kmers will be classified as error reads
- highthresh=12
- (ht) Threshold for high kmer. A high kmer at this or above are considered non-error
- lowthresh=3
- (lt) Threshold for low kmer. Kmers at this and below are always considered errors
Error Correction Parameters
- ecc=f
- Set to true to correct errors. NOTE: Tadpole is now preferred for ecc as it does a better job
- ecclimit=3
- Correct up to this many errors per read. If more are detected, the read will remain unchanged
- errorcorrectratio=140
- (ecr) Adjacent kmers with a depth ratio of at least this much between will be classified as an error
- echighthresh=22
- (echt) Threshold for high kmer. A kmer at this or above may be considered non-error
- eclowthresh=2
- (eclt) Threshold for low kmer. Kmers at this and below are considered errors
- eccmaxqual=127
- Do not correct bases with quality above this value
- aec=f
- (aggressiveErrorCorrection) Sets more aggressive values of ecr=100, ecclimit=7, echt=16, eclt=3
- cec=f
- (conservativeErrorCorrection) Sets more conservative values of ecr=180, ecclimit=2, echt=30, eclt=1, sl=4, pl=4
- meo=f
- (markErrorsOnly) Marks errors by reducing quality value of suspected errors; does not correct anything
- mue=t
- (markUncorrectableErrors) Marks errors only on uncorrectable reads; requires 'ecc=t'
- overlap=f
- (ecco) Error correct by read overlap
Depth Binning Parameters
- lowbindepth=10
- (lbd) Cutoff for low depth bin
- highbindepth=80
- (hbd) Cutoff for high depth bin
- outlow=<file>
- Pairs in which both reads have a median below lbd go into this file
- outhigh=<file>
- Pairs in which both reads have a median above hbd go into this file
- outmid=<file>
- All other pairs go into this file
Histogram Parameters
- hist=<file>
- Specify a file to write the input kmer depth histogram
- histout=<file>
- Specify a file to write the output kmer depth histogram
- histcol=3
- (histogramcolumns) Number of histogram columns, 2 or 3
- pzc=f
- (printzerocoverage) Print lines in the histogram with zero coverage
- histlen=1048576
- Max kmer depth displayed in histogram. Also affects statistics displayed, but does not affect normalization
Peak Calling Parameters
- peaks=<file>
- Write the peaks to this file. Default is stdout
- minHeight=2
- (h) Ignore peaks shorter than this
- minVolume=5
- (v) Ignore peaks with less area than this
- minWidth=3
- (w) Ignore peaks narrower than this
- minPeak=2
- (minp) Ignore peaks with an X-value below this
- maxPeak=BIG
- (maxp) Ignore peaks with an X-value above this
- maxPeakCount=8
- (maxpc) Print up to this many peaks (prioritizing height)
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+
- -da
- Disable assertions
Usage Examples
Estimating Memory Requirements
loglog.sh in=reads.fq
Estimates the number of unique kmers in a dataset to determine memory requirements. For 1 billion kmers using 16-bit cells and 3 hashes: ~12 GB needed for 50% table occupancy.
Basic Coverage Normalization
bbnorm.sh in=reads.fq out=normalized.fq target=100 min=5
Runs 2-pass normalization to produce reads with average depth of 100x. Reads with apparent depth under 5x are presumed to be errors and discarded.
Error Correction Only
ecc.sh in=reads.fq out=corrected.fq
Performs error correction without discarding reads. Equivalent to bbnorm.sh ecc=t keepall passes=1 bits=16 prefilter
.
Kmer Frequency Histogram
khist.sh in=reads.fq khist=khist.txt peaks=peaks.txt
Generates kmer frequency histogram for genome size estimation. The peaks file contains estimates of genome size and ploidy for randomly-sheared genomic DNA.
High-Pass/Low-Pass Filtering
bbnorm.sh in=reads.fq out=highpass.fq outt=lowpass.fq passes=1 target=999999999 min=10
Separates reads: high-depth reads (≥10x) to "out", low-depth reads to "outt".
Three-Bin Depth Sorting
bbnorm.sh in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq passes=1 lowbindepth=10 highbindepth=80
Splits reads into low (<10x), medium (10-80x), and high (≥80x) coverage bins.
Memory-Efficient Processing
bbnorm.sh in=reads.fq out=normalized.fq target=100 prefilter bits=16 -Xmx8g
Uses prefilter to maximize accuracy when memory is limited. Prefilter uses 2-bit cells for low-depth kmers, main table uses 16-bit cells.
Combined Operations
bbnorm.sh in=reads.fq out=normalized.fq target=100 min=5 prefilter ecc khist=before.txt khistout=after.txt
Normalizes, error-corrects, and generates before/after histograms in a single pass.
Algorithm Details
Count-Min Sketch Data Structure
BBNorm uses a Count-Min Sketch (CMS), also called a "counting Bloom filter". This probabilistic data structure stores only values, not keys, and ignores hash collisions. To prevent negative effects of collisions, each kmer's count is stored in multiple hash table locations, and the minimum value across all locations is used when reading counts.
Memory Configuration
- Cell Sizes: 1, 2, 4, 8, 16, or 32 bits per cell
- Capacity Trade-offs: 1GB RAM accommodates 4 billion 2-bit cells (max count 3) or 500 million 16-bit cells (max count 65535)
- Hash Functions: Default 3 hashes per kmer; more hashes improve accuracy until tables become full
- Prefilter Architecture: Two-stage design uses 2-bit prefilter for low-depth kmers, larger cells for high-depth kmers
Memory Management and Scaling
BBNorm automatically uses all available memory for optimal accuracy. As unique kmer count increases beyond memory capacity, accuracy gradually declines but processing never fails. This allows processing arbitrarily large datasets with fixed memory.
Memory Optimization Guidelines
- Ideal Load: Keep hash tables under 50-60% full for optimal accuracy
- Prefilter Flag: Enable when tables exceed 50% capacity (automatic warning at 90%+)
- Cell Size Selection: Use minimum bits needed for expected coverage (e.g., 8-bit cells for 200x coverage)
- Memory Estimation: 3 hashes × 16 bits/kmer × kmer_count / (8 bits/byte × 0.5 load) = RAM needed
Normalization Algorithm
The multi-pass normalization process reduces error enrichment compared to single-pass approaches:
- Kmer Counting Pass: Build hash table from input reads, applying quality and probability filters
- Depth Assessment: For each read, calculate kmer depth distribution and extract percentile-based depth proxy
- Normalization Decision: Compare read depth to target; probabilistically retain reads to achieve flat coverage
- Error Detection: Identify reads with aberrant depth patterns indicating sequencing errors
Comparison to Related Tools
Tool | Memory Requirement | Kmer Storage | Error Correction Quality | Best Use Case |
---|---|---|---|---|
BBNorm | Bounded (CMS) | Counts only | Good | Large dataset normalization |
Tadpole | Proportional to data | Exact kmers + counts | Better | Error correction with sufficient memory |
KmerCountExact | Proportional to data | Exact kmers + counts | N/A | Kmer analysis and exact histograms |
Quality Control Features
- Deterministic Mode: Reproducible results across runs using fixed random seeds
- Spike Detection: Secondary hash lookups to identify collision-induced false high coverage
- Load Monitoring: Automatic warnings when hash tables approach capacity limits
- Multi-threading: Lock-free atomic operations for efficient parallel processing
Temp File Management
Multi-pass processing requires temporary files between passes. BBNorm automatically manages temp file creation, cleanup, and location (configurable via tmpdir parameter). Files are cleaned up upon completion.
Performance and Memory Guidelines
Optimal Performance Settings
- Use all available memory: BBNorm defaults to maximum memory for best accuracy
- Enable prefilter for large datasets: Especially important when hash tables exceed 50% capacity
- Choose appropriate cell sizes: Match bits per cell to expected maximum coverage depth
- Consider single-pass for speed: When memory is abundant and accuracy requirements are lower
Troubleshooting Common Issues
Hash Table Full Warning
If you see warnings about tables being extremely full (>90% used):
- Add
prefilter=t
to use memory more efficiently - Reduce cell size with
bits=16
orbits=8
- Increase
minprob
to filter more spurious kmers - Quality-trim input reads to reduce error-derived kmers
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Guide: BBNorm Guide