BBSplit

Overview

BBSplit internally uses the BBMap alignment engine to map reads to multiple genomes simultaneously, determining which genome each read matches best. This approach is fundamentally different from ordinary single-reference mapping because it focuses on resolving ambiguity between references rather than within a single reference.

For example, if a read maps ambiguously to multiple locations within the human genome, that internal ambiguity is irrelevant when the goal is distinguishing human reads from mouse reads. BBSplit tracks this additional cross-reference ambiguity information and provides specialized tools for handling it.

Primary Applications

Metagenomics: Binning and refining metagenomic reads by taxonomic origin
Contamination Detection: Separating host reads from pathogen sequences
Multi-species Studies: Parsing reads from mixed-species samples
Quality Control: Quantifying cross-contamination between samples

Basic Usage

bbsplit.sh ref=x.fa,y.fa in=reads.fq basename=o%.fq

This is equivalent to:

bbsplit.sh build=1 in=reads.fq ref_x=x.fa ref_y=y.fa out_x=ox.fq out_y=oy.fq

Two-step process:

To index:

bbsplit.sh build=1 ref_x=reference1.fa ref_y=reference2.fa

To map:

bbsplit.sh build=1 in=reads.fq out_x=output1.fq out_y=output2.fq

Parameters

BBSplit uses the BBMap engine internally and supports almost all BBMap parameters, with the addition of specialized multi-reference functionality. Parameters are organized by function:

Indexing Parameters

Required when building the index. BBSplit creates specialized indexes that track which sequences came from which reference file.

ref=<file,file>: A list of references, or directories containing fasta files.
ref_<name>=<ref.fa>: Alternate, longer way to specify references. e.g., ref_ecoli=ecoli.fa. These can also be comma-delimited lists of files; e.g., ref_bacteria=ecoli.fa,salmonella.fa,klebsiella.fa
build=<1>: Designate index to use. Corresponds to the number specified when building the index.
path=<.>: Specify the location to write the index, if you don't want it in the current working directory.

Input Parameters

in=<reads.fq>: Primary reads input; required parameter.
in2=<reads2.fq>: For paired reads in two files.
qin=<auto>: Set to 33 or 64 to specify input quality value ASCII offset.
interleaved=<auto>: True forces paired/interleaved input; false forces single-ended mapping. If not specified, interleaved status will be autodetected from read names.

Mapping Parameters

maxindel=<20>

Don't look for indels longer than this. Lower is faster. Set to >=100k for RNA-seq.

minratio=<0.56>

Fraction of max alignment score required to keep a site. Higher is faster.

minhits=<1>

Minimum number of seed hits required for candidate sites. Higher is faster.

ambiguous=<best>

Set behavior on ambiguously-mapped reads (with multiple top-scoring mapping locations).

best - use the first best site
toss - consider unmapped
random - select one top-scoring site randomly
all - retain all top-scoring sites (does not work yet with SAM output)

ambiguous2=<best>

Set behavior only for reads that map ambiguously to multiple different references. This is BBSplit's key distinguishing parameter - it handles cross-reference ambiguity separately from within-reference ambiguity. Normal 'ambiguous=' controls behavior on all ambiguous reads; ambiguous2= excludes reads that map ambiguously within a single reference.

best - use the first best site
toss - consider unmapped
all - write a copy to the output for each reference to which it maps
split - write a copy to the AMBIGUOUS_ output for each reference to which it maps

qtrim=<true>

Quality-trim ends to Q5 before mapping. Options are 'l' (left), 'r' (right), and 'lr' (both).

untrim=<true>

Undo trimming after mapping. Untrimmed bases will be soft-clipped in cigar strings.

Output Parameters

out_<name>=<file>: Output reads that map to the reference <name> to <file>.
basename=prefix%suffix: Equivalent to multiple out_%=prefix%suffix expressions, in which each % is replaced by the name of a reference file. By default paired reads will yield interleaved output, but you can use the # symbol to produce twin output files. For example, basename=o%_#.fq will produce ox_1.fq, ox_2.fq, oy_1.fq, and oy_2.fq.
bs=<file>: Write a shell script to 'file' that will turn the sam output into a sorted, indexed bam file.
scafstats=<file>: Write statistics on how many reads mapped to which scaffold to this file.
refstats=<file>: Write statistics on how many reads were assigned to which reference to this file. Unmapped reads whose mate mapped to a reference are considered assigned and will be counted.
nzo=t: Only print lines with nonzero coverage.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Metagenomic binning

bbsplit.sh ref_bacteria=ecoli.fa,salmonella.fa ref_human=human_chr.fa \
    in=metagenome.fq basename=binned_%.fq refstats=taxonomy_stats.txt

Separates metagenomic reads into bacterial and human components, generating binned_bacteria.fq and binned_human.fq with taxonomic statistics.

Contamination detection in RNA-seq

bbsplit.sh ref_mouse=mouse_genome.fa ref_human=human_genome.fa \
    in=rnaseq.fq basename=species_%.fq maxindel=200k ambiguous2=toss

Identifies cross-species contamination in RNA-seq data, discarding reads that map ambiguously to both species.

Multi-pathogen detection

# First, build pathogen database
bbsplit.sh build=1 ref_covid=sars_cov2.fa ref_flu=influenza_h1n1.fa ref_rsv=rsv.fa

# Then screen clinical samples
bbsplit.sh build=1 in=clinical_sample.fq out_covid=covid_reads.fq \
    out_flu=flu_reads.fq out_rsv=rsv_reads.fq refstats=pathogen_counts.txt

Screens clinical samples for multiple respiratory pathogens simultaneously.

Handling cross-reference ambiguity

bbsplit.sh ref=closely_related1.fa,closely_related2.fa in=reads.fq \
    basename=out_%.fq ambiguous2=split

Reads mapping to multiple references are written to AMBIGUOUS_closely_related1.fq and AMBIGUOUS_closely_related2.fq files for further analysis.

Paired-end output with statistics

bbsplit.sh ref=ref1.fa,ref2.fa,ref3.fa in=reads_1.fq in2=reads_2.fq \
    basename=sorted_%_#.fq refstats=mapping_stats.txt scafstats=scaffold_stats.txt

Creates separate mate files (sorted_ref1_1.fq, sorted_ref1_2.fq, etc.) with comprehensive mapping statistics.

Algorithm Details

Cross-Reference Ambiguity Resolution

BBSplit's primary innovation is its handling of cross-reference ambiguity. Unlike standard mapping where ambiguous reads are those mapping to multiple locations within a genome, BBSplit distinguishes between within-reference ambiguity (multiple locations in human genome) and cross-reference ambiguity (mapping to both human and mouse genomes).

The algorithm uses the getSets() method to extract reference names from scaffold prefixes, comparing HashSet collections (p1, p2, s1, s2) representing primary and secondary reference mappings for read pairs. Reads are marked as cross-reference ambiguous when p1≠p2 or when primary sets don't contain all secondary mappings, enabling precise control over multi-species read assignment.

Reference Merging and Indexing

During indexing, BBSplit merges multiple reference files into a unified structure while preserving reference identity. The mergeReferences() method processes reference files and generates scaffold names with reference prefixes using '$' delimiter separation (e.g., "bacteria$ecoli_chromosome"). This encoding allows reads to maintain reference identity throughout the alignment pipeline while enabling efficient single-pass mapping.

The merged reference file uses a hash-based naming scheme (merged_ref_[key].fa.gz) derived from the reference set composition, enabling automatic index reuse when the same reference combination is encountered again.

Multi-Engine Mapping Support

BBSplit supports multiple alignment engines optimized for different data types:

MAP_NORMAL: Standard BBMap for Illumina reads
MAP_ACC: BBMapAcc with minratio=0.4 for increased sensitivity
MAP_PACBIO: BBMapPacBio for long reads with high indel rates
MAP_PACBIOSKIMMER: BBMapPacBioSkimmer for comprehensive alignment discovery

Statistics and Counting

BBSplit maintains detailed statistics through LinkedHashMap collections tracking both reference-level and scaffold-level metrics. The SetCount class maintains six synchronized counters: mappedReads, ambiguousReads, assignedReads, mappedBases, ambiguousBases, and assignedBases.

Statistics are updated during read processing through the addToScafCounts() method, which handles both unambiguous assignments and proportional counting for cross-reference ambiguous reads. This provides accurate quantification of taxonomic composition in metagenomic samples.

Memory Optimization

BBSplit uses automatic memory detection through calcXmx() with platform-specific optimizations. Memory allocation is based on reference size and thread count, with compression level 2 for intermediate files. The tool employs backpressure mechanisms to prevent memory exhaustion during large metagenomic analyses.

When to Use BBSplit vs Alternatives

BBSplit is ideal for:

References longer than read length (typically >150bp for Illumina)
Distinguishing between different species or strains
Metagenomic read binning and taxonomic assignment
Contamination detection and removal
Multi-reference quantification

Use Seal instead when:

Reference sequences are shorter than read length
Searching for specific short sequences or adapters
Exact kmer matching is sufficient

Use standard BBMap when:

Mapping to a single reference genome
Standard genomic or transcriptomic analysis
Variant calling or SNP detection

Important Notes

BBMap compatibility: Almost all BBMap parameters can be used with BBSplit - run bbmap.sh for complete parameter list
Index requirement: BBSplit can only be run using references indexed with BBSplit, as they contain essential reference tracking information
Disk usage: The 'nodisk' flag is not supported by BBSplit due to reference merging requirements
Output formats: BBSplit is recommended for fastq and fasta output, not for sam/bam output
Paired reads: By default, paired reads yield interleaved output unless the # symbol is used in basename
Reference length: When reference sequences are shorter than read length, use Seal instead of BBSplit

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org