BBSplit

Script: bbsplit.sh Package: align2 Class: BBSplitter.java

Multi-reference read mapping tool for metagenomics and contamination detection. Maps reads to multiple references simultaneously and determines the best-matching genome for each read, with sophisticated handling of cross-reference ambiguity.

Overview

BBSplit internally uses the BBMap alignment engine to map reads to multiple genomes simultaneously, determining which genome each read matches best. This approach is fundamentally different from ordinary single-reference mapping because it focuses on resolving ambiguity between references rather than within a single reference.

For example, if a read maps ambiguously to multiple locations within the human genome, that internal ambiguity is irrelevant when the goal is distinguishing human reads from mouse reads. BBSplit tracks this additional cross-reference ambiguity information and provides specialized tools for handling it.

Primary Applications

Basic Usage

bbsplit.sh ref=x.fa,y.fa in=reads.fq basename=o%.fq

This is equivalent to:

bbsplit.sh build=1 in=reads.fq ref_x=x.fa ref_y=y.fa out_x=ox.fq out_y=oy.fq

Two-step process:

To index:

bbsplit.sh build=1 ref_x=reference1.fa ref_y=reference2.fa

To map:

bbsplit.sh build=1 in=reads.fq out_x=output1.fq out_y=output2.fq

Parameters

BBSplit uses the BBMap engine internally and supports almost all BBMap parameters, with the addition of specialized multi-reference functionality. Parameters are organized by function:

Indexing Parameters

Required when building the index. BBSplit creates specialized indexes that track which sequences came from which reference file.

ref=<file,file>
A list of references, or directories containing fasta files.
ref_<name>=<ref.fa>
Alternate, longer way to specify references. e.g., ref_ecoli=ecoli.fa. These can also be comma-delimited lists of files; e.g., ref_bacteria=ecoli.fa,salmonella.fa,klebsiella.fa
build=<1>
Designate index to use. Corresponds to the number specified when building the index.
path=<.>
Specify the location to write the index, if you don't want it in the current working directory.

Input Parameters

in=<reads.fq>
Primary reads input; required parameter.
in2=<reads2.fq>
For paired reads in two files.
qin=<auto>
Set to 33 or 64 to specify input quality value ASCII offset.
interleaved=<auto>
True forces paired/interleaved input; false forces single-ended mapping. If not specified, interleaved status will be autodetected from read names.

Mapping Parameters

maxindel=<20>
Don't look for indels longer than this. Lower is faster. Set to >=100k for RNA-seq.
minratio=<0.56>
Fraction of max alignment score required to keep a site. Higher is faster.
minhits=<1>
Minimum number of seed hits required for candidate sites. Higher is faster.
ambiguous=<best>
Set behavior on ambiguously-mapped reads (with multiple top-scoring mapping locations).
  • best - use the first best site
  • toss - consider unmapped
  • random - select one top-scoring site randomly
  • all - retain all top-scoring sites (does not work yet with SAM output)
ambiguous2=<best>
Set behavior only for reads that map ambiguously to multiple different references. This is BBSplit's key distinguishing parameter - it handles cross-reference ambiguity separately from within-reference ambiguity. Normal 'ambiguous=' controls behavior on all ambiguous reads; ambiguous2= excludes reads that map ambiguously within a single reference.
  • best - use the first best site
  • toss - consider unmapped
  • all - write a copy to the output for each reference to which it maps
  • split - write a copy to the AMBIGUOUS_ output for each reference to which it maps
qtrim=<true>
Quality-trim ends to Q5 before mapping. Options are 'l' (left), 'r' (right), and 'lr' (both).
untrim=<true>
Undo trimming after mapping. Untrimmed bases will be soft-clipped in cigar strings.

Output Parameters

out_<name>=<file>
Output reads that map to the reference <name> to <file>.
basename=prefix%suffix
Equivalent to multiple out_%=prefix%suffix expressions, in which each % is replaced by the name of a reference file. By default paired reads will yield interleaved output, but you can use the # symbol to produce twin output files. For example, basename=o%_#.fq will produce ox_1.fq, ox_2.fq, oy_1.fq, and oy_2.fq.
bs=<file>
Write a shell script to 'file' that will turn the sam output into a sorted, indexed bam file.
scafstats=<file>
Write statistics on how many reads mapped to which scaffold to this file.
refstats=<file>
Write statistics on how many reads were assigned to which reference to this file. Unmapped reads whose mate mapped to a reference are considered assigned and will be counted.
nzo=t
Only print lines with nonzero coverage.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Metagenomic binning

bbsplit.sh ref_bacteria=ecoli.fa,salmonella.fa ref_human=human_chr.fa \
    in=metagenome.fq basename=binned_%.fq refstats=taxonomy_stats.txt

Separates metagenomic reads into bacterial and human components, generating binned_bacteria.fq and binned_human.fq with taxonomic statistics.

Contamination detection in RNA-seq

bbsplit.sh ref_mouse=mouse_genome.fa ref_human=human_genome.fa \
    in=rnaseq.fq basename=species_%.fq maxindel=200k ambiguous2=toss

Identifies cross-species contamination in RNA-seq data, discarding reads that map ambiguously to both species.

Multi-pathogen detection

# First, build pathogen database
bbsplit.sh build=1 ref_covid=sars_cov2.fa ref_flu=influenza_h1n1.fa ref_rsv=rsv.fa

# Then screen clinical samples
bbsplit.sh build=1 in=clinical_sample.fq out_covid=covid_reads.fq \
    out_flu=flu_reads.fq out_rsv=rsv_reads.fq refstats=pathogen_counts.txt

Screens clinical samples for multiple respiratory pathogens simultaneously.

Handling cross-reference ambiguity

bbsplit.sh ref=closely_related1.fa,closely_related2.fa in=reads.fq \
    basename=out_%.fq ambiguous2=split

Reads mapping to multiple references are written to AMBIGUOUS_closely_related1.fq and AMBIGUOUS_closely_related2.fq files for further analysis.

Paired-end output with statistics

bbsplit.sh ref=ref1.fa,ref2.fa,ref3.fa in=reads_1.fq in2=reads_2.fq \
    basename=sorted_%_#.fq refstats=mapping_stats.txt scafstats=scaffold_stats.txt

Creates separate mate files (sorted_ref1_1.fq, sorted_ref1_2.fq, etc.) with comprehensive mapping statistics.

Algorithm Details

Cross-Reference Ambiguity Resolution

BBSplit's primary innovation is its handling of cross-reference ambiguity. Unlike standard mapping where ambiguous reads are those mapping to multiple locations within a genome, BBSplit distinguishes between within-reference ambiguity (multiple locations in human genome) and cross-reference ambiguity (mapping to both human and mouse genomes).

The algorithm uses the getSets() method to extract reference names from scaffold prefixes, comparing HashSet collections (p1, p2, s1, s2) representing primary and secondary reference mappings for read pairs. Reads are marked as cross-reference ambiguous when p1≠p2 or when primary sets don't contain all secondary mappings, enabling precise control over multi-species read assignment.

Reference Merging and Indexing

During indexing, BBSplit merges multiple reference files into a unified structure while preserving reference identity. The mergeReferences() method processes reference files and generates scaffold names with reference prefixes using '$' delimiter separation (e.g., "bacteria$ecoli_chromosome"). This encoding allows reads to maintain reference identity throughout the alignment pipeline while enabling efficient single-pass mapping.

The merged reference file uses a hash-based naming scheme (merged_ref_[key].fa.gz) derived from the reference set composition, enabling automatic index reuse when the same reference combination is encountered again.

Multi-Engine Mapping Support

BBSplit supports multiple alignment engines optimized for different data types:

Statistics and Counting

BBSplit maintains detailed statistics through LinkedHashMap collections tracking both reference-level and scaffold-level metrics. The SetCount class maintains six synchronized counters: mappedReads, ambiguousReads, assignedReads, mappedBases, ambiguousBases, and assignedBases.

Statistics are updated during read processing through the addToScafCounts() method, which handles both unambiguous assignments and proportional counting for cross-reference ambiguous reads. This provides accurate quantification of taxonomic composition in metagenomic samples.

Memory Optimization

BBSplit uses automatic memory detection through calcXmx() with platform-specific optimizations. Memory allocation is based on reference size and thread count, with compression level 2 for intermediate files. The tool employs backpressure mechanisms to prevent memory exhaustion during large metagenomic analyses.

When to Use BBSplit vs Alternatives

BBSplit is ideal for:

Use Seal instead when:

Use standard BBMap when:

Important Notes

Support

For questions and support: