BBMerge

Script: bbmerge.sh / bbmerge-auto.sh Package: jgi Class: BBMerge.java

Merges paired reads into single reads by overlap detection. The fastest and most accurate overlap-based read merger currently in existence. With sufficient coverage, can merge nonoverlapping reads by kmer extension through Tadpole integration.

Key Features

Dual Memory Modes: bbmerge.sh (1GB fixed) for overlap-based merging vs bbmerge-auto.sh (auto-memory) for kmer operations
Multiple Merging Strategies: Overlap-based, neural network, ratio-based scoring, and kmer extension
Predefined Strictness Levels: Nine preset configurations from xstrict to xloose for different accuracy/throughput trade-offs
Error Correction: Overlap-based correction with optional Tadpole integration for kmer-based error correction
Adapter Detection: Automatic discovery and validation of adapter sequences
Comprehensive Statistics: Insert size histograms, merge rates, false positive/negative rates

Script Selection Guide

bbmerge.sh (Standard Script)

Memory Usage: Fixed 1GB RAM allocation
Primary Use: Overlap-based merging only
Best For: Most standard read merging tasks, amplicon studies
Limitations: Cannot perform kmer extension or error correction with Tadpole

bbmerge-auto.sh (High-Memory Script)

Memory Usage: Attempts to grab all available physical memory
Primary Use: Kmer-based operations using Tadpole or Bloom filters
Best For: Shotgun libraries, merging non-overlapping reads, kmer-based error correction
Requirements: Sufficient coverage (≥5x), never use with amplicon libraries

When NOT to Use BBMerge

Important: If under 15% of reads merge even at very loose stringency, merging is probably not worthwhile and may introduce false positives.

Single-ended libraries
Long mate-pair libraries not in "innie" orientation
Libraries with very large insert sizes relative to read length
When merge rate is consistently below 15%

Basic Usage

# Interleaved input (overlap-based merging)
bbmerge.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt

# Twin files (overlap-based merging) 
bbmerge.sh in1=read1.fq in2=read2.fq out=merged.fq outu1=unmerged_R1.fq outu2=unmerged_R2.fq

# Kmer-based operations (requires bbmerge-auto.sh)
bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq extend2=50 k=62

Input may be stdin or a file, fasta or fastq, raw or gzipped.

Predefined Strictness Levels

BBMerge provides nine predefined strictness levels that adjust multiple parameters simultaneously based on extensive benchmarking. Stricter settings have lower merge rates and fewer false positives; looser settings have higher merge rates and more false positives.

Strictness Hierarchy (Strictest to Loosest)

xstrict (maxstrict): Maximally strict - lowest false positive rate, lowest merge rate
ustrict (ultrastrict): Ultra strict - very low false positive rate
vstrict (verystrict): Very strict - recommended for assembly preparation
strict: Strict - good balance of accuracy and merge rate
default: Default settings - balanced for general use
loose: Loose - higher merge rate, recommended for insert size distribution
vloose (veryloose): Very loose - much higher merge rate
uloose (ultraloose): Ultra loose - very high merge rate
xloose (maxloose): Maximally loose - highest merge rate, only for low-quality data

Using Strictness Levels

# Very strict for assembly (low false positives)
bbmerge.sh in=reads.fq out=merged.fq vstrict

# Loose for insert size estimation (high merge rate)
bbmerge.sh in=reads.fq out=merged.fq ihist=insert_dist.txt loose

Parameters

Input Parameters

in=null: Primary input. 'in2' will specify a second file.
interleaved=auto: May be set to true or false to override autodetection of whether the input file as interleaved.
reads=-1: Quit after this many read pairs (-1 means all).

Output Parameters

out=<file>: File for merged reads. 'out2' will specify a second file.
outu=<file>: File for unmerged reads. 'outu2' will specify a second file.
outinsert=<file>: (outi) File to write read names and insert sizes.
outadapter=<file>: (outa) File to write consensus adapter sequences.
outc=<file>: File to write input read kmer cardinality estimate.
ihist=<file>: (hist) Insert length histogram output file.
nzo=t: Only print histogram bins with nonzero values.
showhiststats=t: Print extra header lines with statistical information.
ziplevel=2: Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
ordered=f: Output reads in same order as input.
mix=f: Output both the merged (or mergable) and unmerged reads in the same file (out=). Useful for ecco mode.

Trimming/Filtering Parameters

qtrim=f: Trim read ends to remove bases with quality below minq. Trims BEFORE merging. Values: t (trim both ends), f (neither end), r (right end only), l (left end only).
qtrim2=f: May be specified instead of qtrim to perform trimming only if merging is unsuccessful, then retry merging.
trimq=10: Trim quality threshold. This may be a comma-delimited list (ascending) to try multiple values.
minlength=1: (ml) Reads shorter than this after trimming, but before merging, will be discarded. Pairs will be discarded only if both are shorter.
maxlength=-1: Reads with longer insert sizes will be discarded.
tbo=f: (trimbyoverlap) Trim overlapping reads to remove rightmost (3') non-overlapping portion, instead of joining.
minavgquality=0: (maq) Reads with average quality below this, after trimming, will not be attempted to be merged.
maxexpectederrors=0: (mee) If positive, reads with more combined expected errors than this will not be attempted to be merged.
forcetrimleft=0: (ftl) If nonzero, trim left bases of the read to this position (exclusive, 0-based).
forcetrimright=0: (ftr) If nonzero, trim right bases of the read after this position (exclusive, 0-based).
forcetrimright2=0: (ftr2) If positive, trim this many bases on the right end.
forcetrimmod=5: (ftm) If positive, trim length to be equal to zero modulo this number.
ooi=f: Output only incorrectly merged reads, for testing.
trimpolya=t: Trim trailing poly-A tail from adapter output. Only affects outadapter. This also trims poly-A followed by poly-G, which occurs on NextSeq.

Processing Parameters

usejni=f: (jni) Do overlapping in C code, which is faster. Requires compiling the C code; details are in /jni/README.txt. However, the jni path is currently disabled.
merge=t: Create merged reads. If set to false, you can still generate an insert histogram.
ecco=f: Error-correct the overlapping part, but don't merge.
trimnonoverlapping=f: (tno) Trim all non-overlapping portions, leaving only consensus sequence. By default, only sequence to the right of the overlap (adapter sequence) is trimmed.
useoverlap=t: Attempt find the insert size using read overlap.
mininsert=15: Minimum insert size to merge reads.
mininsert0=12: Insert sizes less than this will not be considered. Must be less than or equal to mininsert.
minoverlap=12: Minimum number of overlapping bases to allow merging.
minoverlap0=8: Overlaps shorter than this will not be considered. Must be less than or equal to minoverlap.
minq=9: Ignore bases with quality below this.
maxq=41: Cap output quality scores at this.
quantize=1: Set to a higher number to eliminate some quality scores for a lower output filesize.
entropy=t: Increase the minimum overlap requirement for low-complexity reads.
efilter=6: Ban overlaps with over this many times the expected number of errors. Lower is more strict. -1 disables.
pfilter=0.00004: Ban improbable overlaps. Higher is more strict. 0 will disable the filter; 1 will allow only perfect overlaps.
kfilter=0: Ban overlaps that create kmers with count below this value (0 disables). If this is used minprob should probably be set to 0. Requires good coverage.
ouq=f: Calculate best overlap using quality values.
owq=t: Calculate best overlap without using quality values.
usequality=t: If disabled, quality values are completely ignored, both for overlap detection and filtering. May be useful for data with inaccurate quality values.
iupacton=f: (itn) Change ambiguous IUPAC symbols to N.
adapter=: Specify the adapter sequences used for these reads, if known; this can be a fasta file or a literal sequence. Read 1 and 2 can have adapters specified independently with the adapter1 and adapter2 flags. adapter=default will use a list of common adapter sequences.

Neural Network Mode Parameters

nn=t: Use a neural network for increased merging accuracy. This is highly recommended, but will conflict with strictness and ratiomode flags. Stringency in nn mode should be adjusted via the cutoff flag instead.
cutoff=0.872857: Merge reads with nn score above this value. Lower will increase the merge rate at the cost of false positives.
net=<file>: Optional network to specify (for developer use); the default is bbmap/resources/bbmerge.bbnet.

Ratio Mode Parameters

ratiomode=t: Score overlaps based on the ratio of matching to mismatching bases.
maxratio=0.09: Max error rate; higher increases merge rate.
ratiomargin=5.5: Lower increases merge rate; min is 1.
ratiooffset=0.55: Lower increases merge rate; min is 0.
maxmismatches=20: Maximum mismatches allowed in overlapping region.
ratiominoverlapreduction=3: This is the difference between minoverlap in flat mode and minoverlap in ratio mode; generally, minoverlap should be lower in ratio mode.
minsecondratio=0.1: Cutoff for second-best overlap ratio.
forcemerge=f: Disable all filters and just merge everything (not recommended).

Strictness Parameters

These are mutually exclusive macros that set other parameters

strict=f: Decrease false positive rate and merging rate.
verystrict=f: (vstrict) Greatly decrease FP and merging rate.
ultrastrict=f: (ustrict) Decrease FP and merging rate even more.
maxstrict=f: (xstrict) Maximally decrease FP and merging rate.
loose=f: Increase false positive rate and merging rate.
veryloose=f: (vloose) Greatly increase FP and merging rate.
ultraloose=f: (uloose) Increase FP and merging rate even more.
maxloose=f: (xloose) Maximally decrease FP and merging rate.
fast=f: Fastest possible mode; less accurate.

Tadpole Parameters (for read extension and error-correction)

Note: These require more memory and should be run with bbmerge-auto.sh.

k=31: Kmer length. 31 (or less) is fastest and uses the least memory, but higher values may be more accurate. 60 tends to work well for 150bp reads.
extend=0: Extend reads to the right this much before merging. Requires sufficient (>5x) kmer coverage.
extend2=0: Extend reads this much only after a failed merge attempt, or in rem/rsem mode.
iterations=1: (ei) Iteratively attempt to extend by extend2 distance and merge up to this many times.
rem=f: (requireextensionmatch) Do not merge if the predicted insert size differs before and after extension. However, if only the extended reads overlap, then that insert will be used. Requires setting extend2.
rsem=f: (requirestrictextensionmatch) Similar to rem but stricter. Reads will only merge if the predicted insert size before and after extension match. Requires setting extend2. Enables the lowest possible false-positive rate.
ecctadpole=f: (ecct) If reads fail to merge, error-correct with Tadpole and try again. This happens prior to extend2.
reassemble=t: If ecct is enabled, use Tadpole's reassemble mode for error correction. Alternatives are pincer and tail.
removedeadends: (shave) Remove kmers leading to dead ends.
removebubbles: (rinse) Remove kmers in error bubbles.
mindepthseed=3: (mds) Minimum kmer depth to begin extension.
mindepthextend=2: (mde) Minimum kmer depth continue extension.
branchmult1=20: Min ratio of 1st to 2nd-greatest path depth at high depth.
branchmult2=3: Min ratio of 1st to 2nd-greatest path depth at low depth.
branchlower=3: Max value of 2nd-greatest path depth to be considered low.
ibb=t: Ignore backward branches when extending.
extra=<file>: A file or comma-delimited list of files of reads to use for kmer counting, but not for merging or output.
prealloc=f: Pre-allocate memory rather than dynamically growing; faster and more memory-efficient for large datasets. A float fraction (0-1) may be specified, default 1.
prefilter=0: If set to a positive integer, use a countmin sketch to ignore kmers with depth of that value or lower, to reduce memory usage.
filtermem=0: Allows manually specifying prefilter memory in bytes, for deterministic runs. 0 will set it automatically.
minprob=0.5: Ignore kmers with overall probability of correctness below this, to reduce memory usage.
minapproxoverlap=26: For rem mode, do not merge reads if the extended reads indicate that the raw reads should have overlapped by at least this much, but no overlap was found.

Bloom Filter Parameters (for kmer operations with less memory than Tadpole)

Note: These require more memory and should be run with bbmerge-auto.sh.

eccbloom=f: (eccb) If reads fail to merge, error-correct with bbcms and try again.
testmerge=f: Test kmer counts around the read merge junctions. If it appears that the merge created new errors, undo it. This reduces the false-positive rate, but not as much as rem or rsem.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. For example, -Xmx400m will specify 400 MB RAM.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Usage Examples

Basic Merging

# Basic overlap-based merging with insert size histogram
bbmerge.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt

# Twin file format
bbmerge.sh in1=reads_R1.fq in2=reads_R2.fq out=merged.fq outu1=unmerged_R1.fq outu2=unmerged_R2.fq

Basic merging by overlap detection. Insert size histogram can be produced even without specifying output files.

Overlap-based Error Correction

# Error-correct overlapping bases without merging
bbmerge.sh in=reads.fq out=corrected.fq ecco mix

Corrects reads that overlap rather than merging them. Quality scores are increased where reads agree, reduced where they disagree.

Kmer-based Merging of Non-overlapping Reads

# Merge non-overlapping reads using kmer extension (requires bbmerge-auto.sh)
bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt ecct extend2=20 iterations=5

Attempts overlap merging first, then error-corrects with Tadpole, then extends reads by up to 20bp per iteration for up to 5 iterations. Can increase mergeable insert size by up to 200bp.

Conservative Kmer-based Merging

# Reduce false positives in repetitive regions
bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq rem extend2=50 k=62

# Even stricter validation
bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq rsem extend2=50 k=62

REM mode requires consistent insert size predictions before and after extension. RSEM mode is stricter, only merging when predictions exactly match.

Adapter Discovery and Validation

# Discover adapter sequences from short-insert reads
bbmerge.sh in=reads.fq outa=adapters.fa

# Use known adapter sequences to improve accuracy
bbmerge.sh in=reads.fq out=merged.fq adapter1=AGATCGGAAGAGC adapter2=AGATCGGAAGAGC

# Use discovered adapters
bbmerge.sh in=reads.fq out=merged.fq adapters=adapters.fa

Adapter discovery from short-insert pairs, then using adapter sequences to validate merges and substantially increase accuracy.

Optimal Accuracy Command

# Brian's recommended command for maximum accuracy (requires bbmerge-auto.sh)
bbmerge-auto.sh in=reads.fq out=merged.fq adapter1=SEQUENCE1 adapter2=SEQUENCE2 rem k=62 extend2=50 ecct

For shotgun libraries with known adapters and sufficient depth (≥5x), this maximizes correct merges and minimizes incorrect merges.

Strictness Level Examples

# Very strict for assembly (minimize false positives)
bbmerge.sh in=reads.fq out=merged.fq vstrict

# Loose for insert size distribution (maximize merge rate)
bbmerge.sh in=reads.fq ihist=insert_dist.txt loose

# Perfect overlaps only
bbmerge.sh in=reads.fq out=merged.fq pfilter=1

Different strictness levels for different applications. Use strict modes for downstream assembly, loose modes for insert size estimation.

Algorithm Details

Memory Architecture

BBMerge implements two distinct memory management strategies:

Overlap Mode (bbmerge.sh): Fixed 1GB allocation optimized for overlap detection algorithms, scales linearly with thread count
Kmer Mode (bbmerge-auto.sh): Dynamic memory allocation attempting to use all available RAM for kmer table storage and extension operations

Overlap Detection Strategy

BBMerge employs a multi-tiered overlap validation system:

Primary Detection Methods

Ratio Mode: Calculates ratio of matching to mismatching bases with configurable maxratio threshold (default 0.09)
Flat Mode: Uses fixed mismatch counts with quality-weighted scoring
Neural Network Mode: Trained neural network (bbmerge.bbnet) for high-accuracy overlap validation

Quality Integration

Quality scores are integrated through probabilistic error models:

Expected Error Calculation: Computes expected errors as Σ(10^(-Q/10)) across overlap region
Probabilistic Filtering: Removes overlaps with probability below pfilter threshold using quality-based likelihood
Error Rate Validation: Applies efilter to reject overlaps exceeding expected error rate by specified ratio

Entropy-based Low-Complexity Detection

For repetitive sequences, BBMerge dynamically adjusts overlap requirements:

Calculates sequence entropy using k-mer frequency analysis (typically 3-mer)
Increases minimum overlap length for low-entropy sequences based on minentropy threshold
Prevents false merging of homopolymer runs and repetitive elements

Kmer Extension Algorithm

Integration with Tadpole enables extension of non-overlapping reads:

Extension Strategy

Kmer Table Construction: Creates hash tables from input reads using configurable k-mer length (default k=31)
Coverage Thresholds: Uses minimum depth requirements (mindepthseed=3, mindepthextend=2) to ensure reliable extension
Branch Resolution: Applies ratio-based thresholds (branchmult1=20, branchmult2=3) to navigate through genomic repeats

Error Correction Integration

Tadpole Error Correction: Applies reassemble, pincer, or tail modes for sequence correction before extension
Bubble Removal: Eliminates error bubbles through rinse algorithm
Dead-end Pruning: Removes kmers leading to graph termination through shave algorithm

Performance Characteristics

Computational Complexity

Overlap Detection: O(L²) where L is read length, with early termination when mismatch thresholds exceeded
Kmer Operations: O(K×D) where K is unique kmer count and D is average coverage depth
Threading: Linear scaling with processor cores, optimized for systems with 20+ cores when using separate input files

Memory Usage Patterns

Overlap-only: ~1GB base allocation plus read buffer space
Kmer extension: 10-50GB typical, scales with genome size and k-mer length
Neural network: Additional ~100MB for model storage and inference

Accuracy Control

BBMerge provides comprehensive accuracy tuning through preset parameter combinations:

Strictness Presets: Nine predefined levels automatically adjust maxratio, ratiomargin, pfilter, and overlap thresholds
False Positive Control: Configurable through multiple independent filters (ratio, probability, neural network, adapter validation)
Extension Validation: REM and RSEM modes compare insert size predictions before/after extension to minimize false merges

Best Practices

Data Preparation

Adapter Trimming: Recommended prior to merging, particularly for kmer-based operations
Quality Trimming: Generally not recommended unless severe quality issues at read ends, use weak trimming only (qtrim=r trimq=8)
Library Assessment: Check merge rate with default settings; if under 15%, consider if merging is worthwhile

Parameter Selection

Shotgun Libraries: Can use all modes; kmer-based extension requires ≥5x coverage
Amplicon Libraries: Never use kmer-based operations; stick to overlap-based merging only
Assembly Preparation: Use strict modes (vstrict recommended) to minimize false positives
Insert Size Analysis: Use loose modes to maximize merge rate for accurate distributions

Performance Optimization

Threading: Allow automatic detection of all cores for optimal performance
Input Format: Use separate files (in1/in2) rather than interleaved for systems with 20+ cores
Memory Management: Use bbmerge-auto.sh only when kmer operations are needed
JNI Acceleration: Optional C component can provide ~20% speed improvement if compiled

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Detailed Guide: Read bbmap/docs/guides/BBMergeGuide.txt