BBMap

Script: bbmap.sh Package: align2 Class: BBMap.java

Splice-aware global aligner for DNA and RNA sequencing reads. Fast and accurate mapping for all major platforms including Illumina, 454, Sanger, Ion Torrent, Pac Bio, and Nanopore using a multi-kmer-seed-and-extend algorithm.

Overview

BBMap is designed for accurate mapping of sequencing reads to reference genomes, particularly effective with highly mutated genomes or reads containing long indels (including whole-gene deletions over 100kbp). It has no upper limit on genome size or contig number, successfully mapping to genomes as large as 85 gigabases with over 200 million contigs.

The aligner can output comprehensive statistics including empirical read quality histograms, insert-size distributions, and genome coverage statistics, making it valuable for quality control of libraries and sequencing runs, or evaluating new sequencing platforms.

Basic Usage

bbmap.sh ref=<fasta> in=<reads> out=<sam>

Index only: bbmap.sh ref=<fasta>

Map to existing index: bbmap.sh in=<reads> out=<sam>

Map without writing index: bbmap.sh ref=<fasta> in=<reads> out=<sam> nodisk

Standard input/output: in=stdin will accept reads from standard input, and out=stdout will write to standard output. File extensions are still needed to specify the format, e.g., in=stdin.fa.gz will read gzipped fasta from standard input.

Index Management

BBMap requires indexing the reference before mapping. The index location and management follows specific patterns:

Index Location and Build System

By default, BBMap writes indices to a /ref/ subdirectory of your current working directory. If you run BBMap from /home/user/work/, the index will be written to /home/user/work/ref/. Multiple references can be indexed in the same directory using build numbers.

Index Workflow Examples

# Build index for reference A
bbmap.sh ref=A.fa
# Index written to ./ref/ as build 1

# Map to existing index
bbmap.sh in=reads.fq out=mapped.sam
# Loads existing index from ./ref/

# Build second reference with specific build number
bbmap.sh ref=B.fa build=2
# Index written to ./ref/ as build 2

# Build index in custom location
bbmap.sh ref=C.fa path=/custom/location/
# Index written to /custom/location/ref/

Important: Do not have multiple processes write an index to the same location simultaneously, as this will create a corrupt index that must be deleted and regenerated. Use the nodisk flag for parallel mapping jobs, or build the index once and then map multiple files simultaneously.

Memory Management

BBMap memory usage is predictable and scales with reference genome size:

Normal mode: ~6 bytes per reference base
Low-memory mode: ~3 bytes per reference base (use usemodulo flag)
Human genome: ~24GB RAM normally, ~12GB with usemodulo
Thread memory: Additional memory per thread for alignment matrices

Memory usage increases with kmer length. To estimate memory requirements for a specific kmer length, run stats.sh in=reference.fa k=13 before indexing.

Low-Memory Mode

The usemodulo flag reduces memory usage by ~50% with slight sensitivity reduction by discarding ~80% of kmers based on modulo arithmetic. This flag must be used both during indexing and mapping:

# Build low-memory index
bbmap.sh ref=reference.fa usemodulo

# Map using low-memory index
bbmap.sh in=reads.fq out=mapped.sam usemodulo

Parameters

BBMap supports over 140 parameters organized by function. Parameters are grouped to match their role in the mapping process.

Indexing Parameters (required when building the index)

nodisk=f: Set to true to build index in memory and write nothing to disk except output.
ref=<file>: Specify the reference sequence. Only do this ONCE, when building the index (unless using 'nodisk').
build=1: If multiple references are indexed in the same directory, each needs a unique numeric ID (unless using 'nodisk'). Later, this flag can be used to select an index.
k=13: Kmer length, range 8-15. Longer is faster but uses more memory. Shorter is more sensitive. If indexing and mapping are done in two steps, K should be specified each time.
path=<.>: Specify the location to write the index, if you don't want it in the current working directory.
usemodulo=f: Throw away ~80% of kmers based on remainder modulo a number (reduces RAM by 50% and sensitivity slightly). Should be enabled both when building the index AND when mapping.
rebuild=f: Force a rebuild of the index (ref= should be set).

Input Parameters

in=<file>: Primary reads input; required parameter.
in2=<file>: For paired reads in two files.
interleaved=auto: True forces paired/interleaved input; false forces single-ended mapping. If not specified, interleaved status will be autodetected from read names.
fastareadlen=500: Break up FASTA reads longer than this. Max is 500 for BBMap and 6000 for BBMapPacBio. Only works for FASTA input (use 'maxlen' for FASTQ input). The default for bbmap.sh is 500, and for mapPacBio.sh is 6000.
unpigz=f: Spawn a pigz (parallel gzip) process for faster decompression than using Java. Requires pigz to be installed.
touppercase=t: (tuc) Convert lowercase letters in reads to upper case (otherwise they will not match the reference).

Sampling Parameters

reads=-1: Set to a positive number N to only process the first N reads (or pairs), then quit. -1 means use all reads.
samplerate=1: Set to a number from 0 to 1 to randomly select that fraction of reads for mapping. 1 uses all reads.
skipreads=0: Set to a number N to skip the first N reads (or pairs), then map the rest.

Mapping Parameters

fast=f: This flag is a macro which sets other parameters to run faster, at reduced sensitivity. Bad for RNA-seq. Sets tipsearch=20, maxindel=80, minhits=2, bwr=0.18, bw=40, minratio=0.65, midpad=150, minscaf=50, quickmatch=t, rescuemismatches=15, rescuedist=800, maxsites=3, maxsites2=100.
slow=f: This flag is a macro which sets other parameters to run slower, at greater sensitivity. Sets tipsearch=150, minhits=1, minratio=0.45. 'vslow' is even slower.
maxindel=16000: Don't look for indels longer than this. Lower is faster. Set to >=100k for RNAseq with long introns like mammals.
strictmaxindel=f: When enabled, do not allow indels longer than 'maxindel'. By default these are not sought, but may be found anyway.
tipsearch=100: Look this far for read-end deletions with anchors shorter than K, using brute force.
minid=0.76: Approximate minimum alignment identity to look for. Higher is faster and less sensitive.
minhits=1: Minimum number of seed hits required for candidate sites. Higher is faster.
local=f: Set to true to use local, rather than global, alignments. This will soft-clip ugly ends of poor alignments.
perfectmode=f: Allow only perfect mappings when set to true (very fast).
semiperfectmode=f: Allow only perfect and semiperfect (perfect except for N's in the reference) mappings.
threads=auto: (t) Set to number of threads desired. By default, uses all cores available.
ambiguous=best: (ambig) Set behavior on ambiguously-mapped reads (with multiple top-scoring mapping locations).
best - use the first best site
toss - consider unmapped
random - select one top-scoring site randomly
all - retain all top-scoring sites
samestrandpairs=f: (ssp) Specify whether paired reads should map to the same strand or opposite strands.
requirecorrectstrand=t: (rcs) Forbid pairing of reads without correct strand orientation. Set to false for long-mate-pair libraries.
killbadpairs=f: (kbp) If a read pair is mapped with an inappropriate insert size or orientation, the read with the lower mapping quality is marked unmapped.
pairedonly=f: (po) Treat unpaired reads as unmapped. Thus they will be sent to 'outu' but not 'outm'.
rcomp=f: Reverse complement both reads prior to mapping (for LMP outward-facing libraries).
rcompmate=f: Reverse complement read2 prior to mapping.
pairlen=32000: Set max allowed distance between paired reads. (insert size)=(pairlen)+(read1 length)+(read2 length)
rescuedist=1200: Don't try to rescue paired reads if avg. insert size greater than this. Lower is faster.
rescuemismatches=32: Maximum mismatches allowed in a rescued read. Lower is faster.
averagepairdist=100: (apd) Initial average distance between paired reads. Varies dynamically; does not need to be specified.
deterministic=f: Run in deterministic mode. In this case it is good to set averagepairdist. BBMap is deterministic without this flag if using single-ended reads, or run singlethreaded.
bandwidthratio=0: (bwr) If above zero, restrict alignment band to this fraction of read length. Faster but less accurate.
bandwidth=0: (bw) Set the bandwidth directly. Faster but less accurate.
usejni=f: (jni) Do alignments faster, in C code. Requires compiling the C code; details are in /jni/README.txt.
maxsites2=800: Don't analyze (or print) more than this many alignments per read.
ignorefrequentkmers=t: (ifk) Discard low-information kmers that occur often.
excludefraction=0.03: (ef) Fraction of kmers to ignore. For example, 0.03 will ignore the most common 3% of kmers.
greedy=t: Use a greedy algorithm to discard the least-useful kmers on a per-read basis.
kfilter=0: If positive, potential mapping sites must have at least this many consecutive exact matches.

Quality and Trimming Parameters

qin=auto: Set to 33 or 64 to specify input quality value ASCII offset. 33 is Sanger, 64 is old Solexa.
qout=auto: Set to 33 or 64 to specify output quality value ASCII offset (only if output format is fastq).
qtrim=f: Quality-trim ends before mapping. Options are: 'f' (false), 'l' (left), 'r' (right), and 'lr' (both).
untrim=f: Undo trimming after mapping. Untrimmed bases will be soft-clipped in cigar strings.
trimq=6: Trim regions with average quality below this (phred algorithm).
mintrimlength=60: (mintl) Don't trim reads to be shorter than this.
fakefastaquality=-1: (ffq) Set to a positive number 1-50 to generate fake quality strings for fasta input reads.
ignorebadquality=f: (ibq) Keep going, rather than crashing, if a read has out-of-range quality values.
usequality=t: Use quality scores when determining which read kmers to use as seeds.
minaveragequality=0: (maq) Do not map reads with average quality below this.
maqb=0: If positive, calculate maq from this many initial bases.

Output Parameters

out=<file>: Write all reads to this file.
outu=<file>: Write only unmapped reads to this file. Does not include unmapped paired reads with a mapped mate.
outm=<file>: Write only mapped reads to this file. Includes unmapped paired reads with a mapped mate.
mappedonly=f: If true, treats 'out' like 'outm'.
bamscript=<file>: (bs) Write a shell script to <file> that will turn the sam output into a sorted, indexed bam file.
ordered=f: Set to true to output reads in same order as input. Slower and uses more memory.
overwrite=f: (ow) Allow process to overwrite existing files.
secondary=f: Print secondary alignments.
sssr=0.95: (secondarysitescoreratio) Print only secondary alignments with score of at least this fraction of primary.
ssao=f: (secondarysiteasambiguousonly) Only print secondary alignments for ambiguously-mapped reads.
maxsites=5: Maximum number of total alignments to print per read. Only relevant when secondary=t.
quickmatch=f: Generate cigar strings more quickly.
trimreaddescriptions=f: (trd) Truncate read and ref names at the first whitespace, assuming that the remainder is a comment or description.
ziplevel=2: (zl) Compression level for zip or gzip output.
pigz=f: Spawn a pigz (parallel gzip) process for faster compression than Java. Requires pigz to be installed.
machineout=f: Set to true to output statistics in machine-friendly 'key=value' format.
printunmappedcount=f: Print the total number of unmapped reads and bases. If input is paired, the number will be of pairs for which both reads are unmapped.
showprogress=0: If positive, print a '.' every X reads.
showprogress2=0: If positive, print the number of seconds since the last progress update (instead of a '.').
renamebyinsert=f: Renames reads based on their mapped insert size.

Bloom-Filtering Parameters (bloomfilter.sh is the standalone version)

bloom=f: Use a Bloom filter to ignore reads not sharing kmers with the reference. This uses more memory, but speeds mapping when most reads don't match the reference.
bloomhashes=2: Number of hash functions.
bloomminhits=3: Number of consecutive hits to be considered matched.
bloomk=31: Bloom filter kmer length.
bloomserial=t: Use the serialized Bloom filter for greater loading speed, if available. If not, generate and write one.

Post-Filtering Parameters

idfilter=0: Independent of minid; sets exact minimum identity allowed for alignments to be printed. Range 0 to 1.
subfilter=-1: Ban alignments with more than this many substitutions.
insfilter=-1: Ban alignments with more than this many insertions.
delfilter=-1: Ban alignments with more than this many deletions.
indelfilter=-1: Ban alignments with more than this many indels.
editfilter=-1: Ban alignments with more than this many edits.
inslenfilter=-1: Ban alignments with an insertion longer than this.
dellenfilter=-1: Ban alignments with a deletion longer than this.
nfilter=-1: Ban alignments with more than this many ns. This includes nocall, noref, and off scaffold ends.

Sam flags and settings

noheader=f: Disable generation of header lines.
sam=1.4: Set to 1.4 to write Sam version 1.4 cigar strings, with = and X, or 1.3 to use M.
saa=t: (secondaryalignmentasterisks) Use asterisks instead of bases for sam secondary alignments.
cigar=t: Set to 'f' to skip generation of cigar strings (faster).
keepnames=f: Keep original names of paired reads, rather than ensuring both reads have the same name.
intronlen=999999999: Set to a lower number like 10 to change 'D' to 'N' in cigar strings for deletions of at least that length.
rgid=: Set readgroup ID. All other readgroup fields can be set similarly, with the flag rgXX= If you set a readgroup flag to the word 'filename', e.g. rgid=filename, the input file name will be used.
mdtag=f: Write MD tags.
nhtag=f: Write NH tags.
xmtag=f: Write XM tags (may only work correctly with ambig=all).
amtag=f: Write AM tags.
nmtag=f: Write NM tags.
xstag=f: Set to 'xs=fs', 'xs=ss', or 'xs=us' to write XS tags for RNAseq using firststrand, secondstrand, or unstranded libraries. Needed by Cufflinks. JGI mainly uses 'firststrand'.
stoptag=f: Write a tag indicating read stop location, prefixed by YS:i:
lengthtag=f: Write a tag indicating (query,ref) alignment lengths, prefixed by YL:Z:
idtag=f: Write a tag indicating percent identity, prefixed by YI:f:
inserttag=f: Write a tag indicating insert size, prefixed by X8:Z:
scoretag=f: Write a tag indicating BBMap's raw score, prefixed by YR:i:
timetag=f: Write a tag indicating this read's mapping time, prefixed by X0:i:
boundstag=f: Write a tag indicating whether either read in the pair goes off the end of the reference, prefixed by XB:Z:
notags=f: Turn off all optional tags.

Histogram and statistics output parameters

scafstats=<file>: Statistics on how many reads mapped to which scaffold.
refstats=<file>: Statistics on how many reads mapped to which reference file; only for BBSplit.
sortscafs=t: Sort scaffolds or references by read count.
bhist=<file>: Base composition histogram by position.
qhist=<file>: Quality histogram by position.
aqhist=<file>: Histogram of average read quality.
bqhist=<file>: Quality histogram designed for box plots.
lhist=<file>: Read length histogram.
ihist=<file>: Write histogram of insert sizes (for paired reads).
ehist=<file>: Errors-per-read histogram.
qahist=<file>: Quality accuracy histogram of error rates versus quality score.
indelhist=<file>: Indel length histogram.
mhist=<file>: Histogram of match, sub, del, and ins rates by read location.
gchist=<file>: Read GC content histogram.
gcbins=100: Number gchist bins. Set to 'auto' to use read length.
gcpairs=t: Use average GC of paired reads.
idhist=<file>: Histogram of read count versus percent identity.
idbins=100: Number idhist bins. Set to 'auto' to use read length.
statsfile=stderr: Mapping statistics are printed here.

Coverage output parameters (these may reduce speed and use more RAM)

covstats=<file>: Per-scaffold coverage info.
rpkm=<file>: Per-scaffold RPKM/FPKM counts.
covhist=<file>: Histogram of # occurrences of each depth level.
basecov=<file>: Coverage per base location.
bincov=<file>: Print binned coverage per location (one line per X bases).
covbinsize=1000: Set the binsize for binned coverage output.
nzo=t: Only print scaffolds with nonzero coverage.
twocolumn=f: Change to true to print only ID and Avg_fold instead of all 6 columns to the 'out=' file.
32bit=f: Set to true if you need per-base coverage over 64k.
strandedcov=f: Track coverage for plus and minus strand independently.
startcov=f: Only track start positions of reads.
secondarycov=t: Include coverage of secondary alignments.
physcov=f: Calculate physical coverage for paired reads. This includes the unsequenced bases.
delcoverage=t: (delcov) Count bases covered by deletions as covered. True is faster than false.
covk=0: If positive, calculate kmer coverage statistics.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx800m will specify 800 megs. The max is typically 85% of physical memory. The human genome requires around 24g, or 12g with the 'usemodulo' flag. The index uses roughly 6 bytes per reference base.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Common Usage Scenarios

Basic Genome Mapping

bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam

Standard mapping to reference genome with default sensitivity settings.

RNA-seq Mapping (Vertebrates)

bbmap.sh ref=genome.fa in=reads.fq out=mapped.sam maxindel=200k ambig=random intronlen=20 xstag=us

Maps RNA-seq reads allowing for long introns typical of vertebrate genomes. Sets intronlen=20 to convert deletions ≥20bp to introns (N) in CIGAR strings.

High-Sensitivity Mapping

bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam slow k=12

Uses slower, more sensitive settings suitable for divergent sequences or low-quality data.

Fast Mapping for Large Datasets

bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam fast

Rapid alignment prioritizing speed over sensitivity, suitable when most reads are expected to map well.

Contaminant Removal (High Precision)

bbmap.sh ref=contaminant.fa in=reads.fq outm=contaminated.fq outu=clean.fq \
  minratio=0.9 maxindel=3 bwr=0.16 bw=12 fast minhits=2 qtrim=r trimq=10 \
  untrim idtag printunmappedcount kfilter=25 maxsites=1 k=14

Removes reads matching contaminant references with high precision, minimizing false positives.

Coverage Analysis

bbmap.sh ref=reference.fa in=reads.fq \
  covstats=coverage_stats.txt covhist=coverage_histogram.txt \
  basecov=per_base_coverage.txt bincov=binned_coverage.txt

Generates comprehensive coverage statistics including per-scaffold stats, coverage histograms, and per-base coverage depth.

Low-Memory Mapping

# First, build low-memory index
bbmap.sh ref=reference.fa usemodulo

# Then map with same setting
bbmap.sh in=reads.fq out=mapped.sam usemodulo

Reduces memory usage by ~50% with slight sensitivity reduction, useful for large genomes on memory-limited systems.

Multiple Index Management

# Build multiple indices in same directory
bbmap.sh ref=human.fa build=1
bbmap.sh ref=mouse.fa build=2
bbmap.sh ref=virus.fa build=3

# Map to specific index
bbmap.sh in=reads.fq out=mapped.sam build=2

Manages multiple reference indices using build numbers for easy switching between references.

BAM Output with Sorting

bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam bamscript=sort_bam.sh
sh sort_bam.sh

Creates a shell script for converting SAM to sorted, indexed BAM format using samtools.

Algorithm Details

Multi-Kmer-Seed-and-Extend Algorithm

BBMap uses a multi-kmer-seed-and-extend approach, described by the algorithm's creator as analogous to growing polycrystalline silicon. This method provides superior accuracy compared to single-seed approaches, particularly for highly mutated genomes and reads with complex indel patterns.

Global vs Local Alignment

BBMap is fundamentally a global aligner, meaning it seeks the highest-scoring alignment considering all bases in a sequence. This approach is essential for detecting long indels that local aligners might miss by clipping sequence ends. When the local flag is used, BBMap still performs global alignment internally but converts results to local alignments by clipping low-scoring ends.

Scoring System and Identity Calculation

BBMap uses a custom affine-transform scoring matrix rather than simple percent identity. The mapping decision depends on whether the ratio between the actual alignment score and the maximum possible score exceeds the minratio threshold:

Match: +100 points
First mismatch: -127 points
Consecutive mismatch: -51 points
Variable penalties: Based on mutation event length and type (substitution vs indel)

The minid parameter provides user-friendly control by automatically calculating the appropriate minratio value. For example, minid=0.9 sets minratio=0.816 based on the expected score ratio for 90% identity with noncontiguous substitutions.

Performance Modes

BBMap implements three performance presets that adjust multiple parameters simultaneously:

Fast Mode

Optimized for speed at reduced sensitivity (bad for RNA-seq):

Reduces tip search distance (tipsearch=20)
Limits indel detection (maxindel=80)
Increases seed requirements (minhits=2)
Restricts alignment bandwidth (bwr=0.18)
Excludes more frequent kmers (excludefraction*1.25)
Reduces key density (keyDensity*0.9)

Slow Mode

Enhanced sensitivity for challenging data:

Increases tip search distance (tipsearch=150)
Lowers seed requirements (minhits=1)
Reduces minimum alignment ratio (minratio=0.45)
Includes more kmers (excludefraction*0.4)
Increases key density (keyDensity*1.2)

VSlow Mode

Maximum sensitivity for divergent sequences:

Disables quality filtering (usequality=false)
Very low alignment threshold (minratio=0.22)
No kmer exclusion (excludefraction=0)
Maximum key density (keyDensity*2.5)
Extended rescue distance (rescuedist=2500)

Memory Architecture

BBMap's memory usage follows predictable patterns based on reference genome size and kmer length:

Index storage: ~6 bytes per reference base (normal mode)
Low-memory mode: ~3 bytes per reference base with usemodulo
Thread overhead: Additional memory per thread for alignment matrices
Kmer length scaling: Memory increases with longer kmers

The system automatically adjusts parameters based on genome size: genomes under 30MB get enhanced hit reduction and exclusion fraction adjustments, while larger genomes use standard parameters.

Index Management System

BBMap implements a sophisticated index management system that handles multiple references efficiently:

Automatic detection: Checks for existing compatible indices before building
Build numbering: Supports multiple indices in the same directory using build IDs
Path customization: Allows custom index locations via the path parameter
Disk vs memory: Can build indices in memory only (nodisk) or persist to disk

RNA-seq Specific Features

BBMap includes specific optimizations for RNA-seq data:

Splice junction detection: Uses maxindel parameter to detect long introns
Strand-specific mapping: XS tag generation for firststrand/secondstrand/unstranded protocols
Intron annotation: Converts long deletions to introns (N) in CIGAR strings based on intronlen threshold
Rescue operations: Performs brute-force search for poorly mapping paired reads

File Format Support

Input formats: FASTA, FASTQ (compressed or uncompressed). Paired reads can be in two files or interleaved in a single file.

Output formats: FASTA, FASTQ, SAM, BAM (if samtools is installed). Alignment information is preserved only in SAM/BAM formats.

Compression: Supports gzip, bgzip, and can utilize pigz for parallel compression/decompression.

SAM Format Compatibility

BBMap supports both modern SAM 1.4 specification (default) and legacy SAM 1.3 format:

SAM 1.4 (default): Uses X for substitutions, = for matches
SAM 1.3 (legacy): Uses M for both matches and mismatches (use sam=1.3 flag)

Some older programs require the sam=1.3 flag and trd (trimreaddescriptions) for compatibility.

Performance Considerations

BBMap scales near-linearly with processor cores and is optimized for both indexing and mapping phases. Performance can be tuned based on specific requirements:

Speed Optimization

Increase minhits and kmer length
Use fast flag for macro optimization
Reduce maxindel for faster alignment
Adjust sensitivity flags like minratio and bandwidth

Sensitivity Optimization

Use slow or vslow flags
Decrease kmer length (k=12 or lower)
Lower minratio threshold
Increase maxindel for long indel detection

Memory Optimization

Use usemodulo flag to reduce index size by ~50%
Adjust thread count based on available memory
Consider nodisk for temporary mappings

Related Tools

BBSplit: Maps reads to multiple genomes simultaneously, determining best matches
BBWrap: Wrapper for running BBMap multiple times without reloading the index
mapPacBio.sh: Variant optimized for long reads (PacBio/Nanopore) up to 6kbp
BBMapSkimmer: Finds all alignments above a threshold rather than single best alignment

Support

For questions and support:

Author: Brian Bushnell
Email: bbushnell@lbl.gov
Forum: SeqAnswers
Documentation: bbmap.org