BBNorm

Script: bbnorm.sh Package: jgi Class: KmerNormalize.java

Normalizes coverage by down-sampling reads over high-depth areas to achieve flat coverage distribution. Accelerates assembly and often improves assembly quality. Uses Count-Min Sketch probabilistic data structure for memory-bounded operation on unlimited dataset sizes.

Overview

BBNorm is designed to normalize coverage by down-sampling reads over high-depth areas of a genome, resulting in a flat coverage distribution. This process can dramatically accelerate assembly and render intractable datasets tractable, while often improving assembly quality.

Key Capabilities

Coverage Normalization: Flatten uneven coverage distributions for optimal assembly
Error Correction: Kmer-based error correction (though Tadpole is preferred when memory allows)
Depth Binning: Separate reads into low, medium, and high depth categories
Histogram Generation: Create kmer frequency histograms for genome size estimation
Quality Filtering: Remove error-containing reads based on kmer depth profiles

Notable Features

Never Runs Out of Memory: Uses Count-Min Sketch probabilistic data structure that gracefully degrades accuracy rather than failing
Multipass Processing: Reduces average error rate in normalized output (unlike standard normalization which enriches for errors)
Unlimited Kmer Lengths: Supports arbitrarily high k-mer sizes beyond the typical 32-mer limit
Multiple Shell Scripts: Three convenience scripts (bbnorm.sh, ecc.sh, khist.sh) with optimized defaults for different use cases

When to Use BBNorm

Recommended Use Cases

Assembly preprocessing when you have too much data (e.g., 600x coverage when you only want 100x)
Uneven coverage datasets such as amplified single-cell, RNA-seq, viruses, or metagenomes
Memory-limited error correction when Tadpole cannot fit in available memory
Genome size estimation via kmer frequency histogram analysis
Quality filtering to remove reads with aberrant kmer depth patterns

When NOT to Use BBNorm

Important: Never normalize data for these applications:

Quantitative analysis: ChIP-seq, RNA-seq expression profiling, or any coverage-dependent analysis
Variant discovery: Mapping for SNP/indel calling (will introduce bias)
High-error platforms: PacBio or Nanopore data (designed for low-error, fixed-length reads)
Rare variant detection: May correct minority alleles into majority alleles
Optimal coverage: Smooth coverage at appropriate depth (BBNorm cannot inflate coverage)

Alternative: Use random subsampling instead of normalization when you need to reduce data volume for these applications.

Basic Usage

bbnorm.sh in=<input> out=<reads to keep> outt=<reads to toss> hist=<histogram output>

Related Scripts

bbnorm.sh: General normalization with default 2-pass processing
ecc.sh: Error correction mode with optimized defaults (equivalent to bbnorm.sh ecc=t keepall passes=1 bits=16 prefilter)
khist.sh: Histogram generation mode (equivalent to bbnorm.sh passes=1 prefilter minprob=0 minqual=0 mindepth=0)

Parameters

Input Parameters

in=null: Primary input. Use in2 for paired reads in a second file
in2=null: Second input file for paired reads in two files
extra=null: Additional files to use for input (generating hash table) but not for output
fastareadlen=2^31: Break up FASTA reads longer than this. Can be useful when processing scaffolded genomes
tablereads=-1: Use at most this many reads when building the hashtable (-1 means all)
kmersample=1: Process every nth kmer, and skip the rest
readsample=1: Process every nth read, and skip the rest
interleaved=auto: May be set to true or false to force the input read file to override autodetection of the input file as paired interleaved
qin=auto: ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto

Output Parameters

out=<file>: File for normalized or corrected reads. Use out2 for paired reads in a second file
outt=<file>: (outtoss) File for reads that were excluded from primary output
reads=-1: Only process this number of reads, then quit (-1 means all)
sampleoutput=t: Use sampling on output as well as input (not used if sample rates are 1)
keepall=f: Set to true to keep all reads (e.g. if you just want error correction)
zerobin=f: Set to true if you want kmers with a count of 0 to go in the 0 bin instead of the 1 bin in histograms. Default is false, to prevent confusion about how there can be 0-count kmers. The reason is that based on the 'minq' and 'minprob' settings, some kmers may be excluded from the bloom filter
tmpdir=$TMPDIR: This will specify a directory for temp files (only needed for multipass runs). If null, they will be written to the output directory
usetempdir=t: Allows enabling/disabling of temporary directory; if disabled, temp files will be written to the output directory
qout=auto: ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input)
rename=f: Rename reads based on their kmer depth

Hashing Parameters

k=31: Kmer length (values under 32 are most efficient, but arbitrarily high values are supported)
bits=32: Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32. Maximum kmer depth recorded is 2^cbits. Automatically reduced to 16 in 2-pass. Large values decrease accuracy for a fixed amount of memory, so use the lowest number you can that will still capture highest-depth kmers
hashes=3: Number of times each kmer is hashed and stored. Higher is slower. Higher is MORE accurate if there is enough memory, and LESS accurate if there is not enough memory
prefilter=f: True is slower, but generally more accurate; filters out low-depth kmers from the main hashtable. The prefilter is more memory-efficient because it uses 2-bit cells
prehashes=2: Number of hashes for prefilter
prefilterbits=2: (pbits) Bits per cell in prefilter
prefiltersize=0.35: Fraction of memory to allocate to prefilter
buildpasses=1: More passes can sometimes increase accuracy by iteratively removing low-depth kmers
minq=6: Ignore kmers containing bases with quality below this
minprob=0.5: Ignore kmers with overall probability of correctness below this
threads=auto: (t) Spawn exactly X hashing threads (default is number of logical processors). Total active threads may exceed X due to I/O threads
rdk=t: (removeduplicatekmers) When true, a kmer's count will only be incremented once per read pair, even if that kmer occurs more than once

Normalization Parameters

fixspikes=f: (fs) Do a slower, high-precision bloom filter lookup of kmers that appear to have an abnormally high depth due to collisions
target=100: (tgt) Target normalization depth. NOTE: All depth parameters control kmer depth, not read depth. For kmer depth Dk, read depth Dr, read length R, and kmer size K: Dr=Dk*(R/(R-K+1))
maxdepth=-1: (max) Reads will not be downsampled when below this depth, even if they are above the target depth
mindepth=5: (min) Kmers with depth below this number will not be included when calculating the depth of a read
minkmers=15: (mgkpr) Reads must have at least this many kmers over min depth to be retained. Aka 'mingoodkmersperread'
percentile=54.0: (dp) Read depth is by default inferred from the 54th percentile of kmer depth, but this may be changed to any number 1-100
uselowerdepth=t: (uld) For pairs, use the depth of the lower read as the depth proxy
deterministic=t: (dr) Generate random numbers deterministically to ensure identical output between multiple runs. May decrease speed with a huge number of threads
passes=2: (p) 1 pass is the basic mode. 2 passes (default) allows greater accuracy, error detection, better control of output depth

Error Detection Parameters

hdp=90.0: (highdepthpercentile) Position in sorted kmer depth array used as proxy of a read's high kmer depth
ldp=25.0: (lowdepthpercentile) Position in sorted kmer depth array used as proxy of a read's low kmer depth
tossbadreads=f: (tbr) Throw away reads detected as containing errors
requirebothbad=f: (rbb) Only toss bad pairs if both reads are bad
errordetectratio=125: (edr) Reads with a ratio of at least this much between their high and low depth kmers will be classified as error reads
highthresh=12: (ht) Threshold for high kmer. A high kmer at this or above are considered non-error
lowthresh=3: (lt) Threshold for low kmer. Kmers at this and below are always considered errors

Error Correction Parameters

ecc=f: Set to true to correct errors. NOTE: Tadpole is now preferred for ecc as it does a better job
ecclimit=3: Correct up to this many errors per read. If more are detected, the read will remain unchanged
errorcorrectratio=140: (ecr) Adjacent kmers with a depth ratio of at least this much between will be classified as an error
echighthresh=22: (echt) Threshold for high kmer. A kmer at this or above may be considered non-error
eclowthresh=2: (eclt) Threshold for low kmer. Kmers at this and below are considered errors
eccmaxqual=127: Do not correct bases with quality above this value
aec=f: (aggressiveErrorCorrection) Sets more aggressive values of ecr=100, ecclimit=7, echt=16, eclt=3
cec=f: (conservativeErrorCorrection) Sets more conservative values of ecr=180, ecclimit=2, echt=30, eclt=1, sl=4, pl=4
meo=f: (markErrorsOnly) Marks errors by reducing quality value of suspected errors; does not correct anything
mue=t: (markUncorrectableErrors) Marks errors only on uncorrectable reads; requires 'ecc=t'
overlap=f: (ecco) Error correct by read overlap

Depth Binning Parameters

lowbindepth=10: (lbd) Cutoff for low depth bin
highbindepth=80: (hbd) Cutoff for high depth bin
outlow=<file>: Pairs in which both reads have a median below lbd go into this file
outhigh=<file>: Pairs in which both reads have a median above hbd go into this file
outmid=<file>: All other pairs go into this file

Histogram Parameters

hist=<file>: Specify a file to write the input kmer depth histogram
histout=<file>: Specify a file to write the output kmer depth histogram
histcol=3: (histogramcolumns) Number of histogram columns, 2 or 3
pzc=f: (printzerocoverage) Print lines in the histogram with zero coverage
histlen=1048576: Max kmer depth displayed in histogram. Also affects statistics displayed, but does not affect normalization

Peak Calling Parameters

peaks=<file>: Write the peaks to this file. Default is stdout
minHeight=2: (h) Ignore peaks shorter than this
minVolume=5: (v) Ignore peaks with less area than this
minWidth=3: (w) Ignore peaks narrower than this
minPeak=2: (minp) Ignore peaks with an X-value below this
maxPeak=BIG: (maxp) Ignore peaks with an X-value above this
maxPeakCount=8: (maxpc) Print up to this many peaks (prioritizing height)

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+
-da: Disable assertions

Usage Examples

Estimating Memory Requirements

loglog.sh in=reads.fq

Estimates the number of unique kmers in a dataset to determine memory requirements. For 1 billion kmers using 16-bit cells and 3 hashes: ~12 GB needed for 50% table occupancy.

Basic Coverage Normalization

bbnorm.sh in=reads.fq out=normalized.fq target=100 min=5

Runs 2-pass normalization to produce reads with average depth of 100x. Reads with apparent depth under 5x are presumed to be errors and discarded.

Error Correction Only

ecc.sh in=reads.fq out=corrected.fq

Performs error correction without discarding reads. Equivalent to bbnorm.sh ecc=t keepall passes=1 bits=16 prefilter.

Kmer Frequency Histogram

khist.sh in=reads.fq khist=khist.txt peaks=peaks.txt

Generates kmer frequency histogram for genome size estimation. The peaks file contains estimates of genome size and ploidy for randomly-sheared genomic DNA.

High-Pass/Low-Pass Filtering

bbnorm.sh in=reads.fq out=highpass.fq outt=lowpass.fq passes=1 target=999999999 min=10

Separates reads: high-depth reads (≥10x) to "out", low-depth reads to "outt".

Three-Bin Depth Sorting

bbnorm.sh in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq passes=1 lowbindepth=10 highbindepth=80

Splits reads into low (<10x), medium (10-80x), and high (≥80x) coverage bins.

Memory-Efficient Processing

bbnorm.sh in=reads.fq out=normalized.fq target=100 prefilter bits=16 -Xmx8g

Uses prefilter to maximize accuracy when memory is limited. Prefilter uses 2-bit cells for low-depth kmers, main table uses 16-bit cells.

Combined Operations

bbnorm.sh in=reads.fq out=normalized.fq target=100 min=5 prefilter ecc khist=before.txt khistout=after.txt

Normalizes, error-corrects, and generates before/after histograms in a single pass.

Algorithm Details

Count-Min Sketch Data Structure

BBNorm uses a Count-Min Sketch (CMS), also called a "counting Bloom filter". This probabilistic data structure stores only values, not keys, and ignores hash collisions. To prevent negative effects of collisions, each kmer's count is stored in multiple hash table locations, and the minimum value across all locations is used when reading counts.

Memory Configuration

Cell Sizes: 1, 2, 4, 8, 16, or 32 bits per cell
Capacity Trade-offs: 1GB RAM accommodates 4 billion 2-bit cells (max count 3) or 500 million 16-bit cells (max count 65535)
Hash Functions: Default 3 hashes per kmer; more hashes improve accuracy until tables become full
Prefilter Architecture: Two-stage design uses 2-bit prefilter for low-depth kmers, larger cells for high-depth kmers

Memory Management and Scaling

BBNorm automatically uses all available memory for optimal accuracy. As unique kmer count increases beyond memory capacity, accuracy gradually declines but processing never fails. This allows processing arbitrarily large datasets with fixed memory.

Memory Optimization Guidelines

Ideal Load: Keep hash tables under 50-60% full for optimal accuracy
Prefilter Flag: Enable when tables exceed 50% capacity (automatic warning at 90%+)
Cell Size Selection: Use minimum bits needed for expected coverage (e.g., 8-bit cells for 200x coverage)
Memory Estimation: 3 hashes × 16 bits/kmer × kmer_count / (8 bits/byte × 0.5 load) = RAM needed

Normalization Algorithm

The multi-pass normalization process reduces error enrichment compared to single-pass approaches:

Kmer Counting Pass: Build hash table from input reads, applying quality and probability filters
Depth Assessment: For each read, calculate kmer depth distribution and extract percentile-based depth proxy
Normalization Decision: Compare read depth to target; probabilistically retain reads to achieve flat coverage
Error Detection: Identify reads with aberrant depth patterns indicating sequencing errors

Comparison to Related Tools

Tool	Memory Requirement	Kmer Storage	Error Correction Quality	Best Use Case
BBNorm	Bounded (CMS)	Counts only	Good	Large dataset normalization
Tadpole	Proportional to data	Exact kmers + counts	Better	Error correction with sufficient memory
KmerCountExact	Proportional to data	Exact kmers + counts	N/A	Kmer analysis and exact histograms

Quality Control Features

Deterministic Mode: Reproducible results across runs using fixed random seeds
Spike Detection: Secondary hash lookups to identify collision-induced false high coverage
Load Monitoring: Automatic warnings when hash tables approach capacity limits
Multi-threading: Lock-free atomic operations for efficient parallel processing

Temp File Management

Multi-pass processing requires temporary files between passes. BBNorm automatically manages temp file creation, cleanup, and location (configurable via tmpdir parameter). Files are cleaned up upon completion.

Performance and Memory Guidelines

Optimal Performance Settings

Use all available memory: BBNorm defaults to maximum memory for best accuracy
Enable prefilter for large datasets: Especially important when hash tables exceed 50% capacity
Choose appropriate cell sizes: Match bits per cell to expected maximum coverage depth
Consider single-pass for speed: When memory is abundant and accuracy requirements are lower

Troubleshooting Common Issues

Hash Table Full Warning

If you see warnings about tables being extremely full (>90% used):

Add prefilter=t to use memory more efficiently
Reduce cell size with bits=16 or bits=8
Increase minprob to filter more spurious kmers
Quality-trim input reads to reduce error-derived kmers

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Guide: BBNorm Guide