BBSplit
Multi-reference read mapping tool for metagenomics and contamination detection. Maps reads to multiple references simultaneously and determines the best-matching genome for each read, with sophisticated handling of cross-reference ambiguity.
Overview
BBSplit internally uses the BBMap alignment engine to map reads to multiple genomes simultaneously, determining which genome each read matches best. This approach is fundamentally different from ordinary single-reference mapping because it focuses on resolving ambiguity between references rather than within a single reference.
For example, if a read maps ambiguously to multiple locations within the human genome, that internal ambiguity is irrelevant when the goal is distinguishing human reads from mouse reads. BBSplit tracks this additional cross-reference ambiguity information and provides specialized tools for handling it.
Primary Applications
- Metagenomics: Binning and refining metagenomic reads by taxonomic origin
- Contamination Detection: Separating host reads from pathogen sequences
- Multi-species Studies: Parsing reads from mixed-species samples
- Quality Control: Quantifying cross-contamination between samples
Basic Usage
bbsplit.sh ref=x.fa,y.fa in=reads.fq basename=o%.fq
This is equivalent to:
bbsplit.sh build=1 in=reads.fq ref_x=x.fa ref_y=y.fa out_x=ox.fq out_y=oy.fq
Two-step process:
To index:
bbsplit.sh build=1 ref_x=reference1.fa ref_y=reference2.fa
To map:
bbsplit.sh build=1 in=reads.fq out_x=output1.fq out_y=output2.fq
Parameters
BBSplit uses the BBMap engine internally and supports almost all BBMap parameters, with the addition of specialized multi-reference functionality. Parameters are organized by function:
Indexing Parameters
Required when building the index. BBSplit creates specialized indexes that track which sequences came from which reference file.
- ref=<file,file>
- A list of references, or directories containing fasta files.
- ref_<name>=<ref.fa>
- Alternate, longer way to specify references. e.g., ref_ecoli=ecoli.fa. These can also be comma-delimited lists of files; e.g., ref_bacteria=ecoli.fa,salmonella.fa,klebsiella.fa
- build=<1>
- Designate index to use. Corresponds to the number specified when building the index.
- path=<.>
- Specify the location to write the index, if you don't want it in the current working directory.
Input Parameters
- in=<reads.fq>
- Primary reads input; required parameter.
- in2=<reads2.fq>
- For paired reads in two files.
- qin=<auto>
- Set to 33 or 64 to specify input quality value ASCII offset.
- interleaved=<auto>
- True forces paired/interleaved input; false forces single-ended mapping. If not specified, interleaved status will be autodetected from read names.
Mapping Parameters
- maxindel=<20>
- Don't look for indels longer than this. Lower is faster. Set to >=100k for RNA-seq.
- minratio=<0.56>
- Fraction of max alignment score required to keep a site. Higher is faster.
- minhits=<1>
- Minimum number of seed hits required for candidate sites. Higher is faster.
- ambiguous=<best>
- Set behavior on ambiguously-mapped reads (with multiple top-scoring mapping locations).
- best - use the first best site
- toss - consider unmapped
- random - select one top-scoring site randomly
- all - retain all top-scoring sites (does not work yet with SAM output)
- ambiguous2=<best>
- Set behavior only for reads that map ambiguously to multiple different references. This is BBSplit's key distinguishing parameter - it handles cross-reference ambiguity separately from within-reference ambiguity. Normal 'ambiguous=' controls behavior on all ambiguous reads; ambiguous2= excludes reads that map ambiguously within a single reference.
- best - use the first best site
- toss - consider unmapped
- all - write a copy to the output for each reference to which it maps
- split - write a copy to the AMBIGUOUS_ output for each reference to which it maps
- qtrim=<true>
- Quality-trim ends to Q5 before mapping. Options are 'l' (left), 'r' (right), and 'lr' (both).
- untrim=<true>
- Undo trimming after mapping. Untrimmed bases will be soft-clipped in cigar strings.
Output Parameters
- out_<name>=<file>
- Output reads that map to the reference <name> to <file>.
- basename=prefix%suffix
- Equivalent to multiple out_%=prefix%suffix expressions, in which each % is replaced by the name of a reference file. By default paired reads will yield interleaved output, but you can use the # symbol to produce twin output files. For example, basename=o%_#.fq will produce ox_1.fq, ox_2.fq, oy_1.fq, and oy_2.fq.
- bs=<file>
- Write a shell script to 'file' that will turn the sam output into a sorted, indexed bam file.
- scafstats=<file>
- Write statistics on how many reads mapped to which scaffold to this file.
- refstats=<file>
- Write statistics on how many reads were assigned to which reference to this file. Unmapped reads whose mate mapped to a reference are considered assigned and will be counted.
- nzo=t
- Only print lines with nonzero coverage.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Metagenomic binning
bbsplit.sh ref_bacteria=ecoli.fa,salmonella.fa ref_human=human_chr.fa \
in=metagenome.fq basename=binned_%.fq refstats=taxonomy_stats.txt
Separates metagenomic reads into bacterial and human components, generating binned_bacteria.fq and binned_human.fq with taxonomic statistics.
Contamination detection in RNA-seq
bbsplit.sh ref_mouse=mouse_genome.fa ref_human=human_genome.fa \
in=rnaseq.fq basename=species_%.fq maxindel=200k ambiguous2=toss
Identifies cross-species contamination in RNA-seq data, discarding reads that map ambiguously to both species.
Multi-pathogen detection
# First, build pathogen database
bbsplit.sh build=1 ref_covid=sars_cov2.fa ref_flu=influenza_h1n1.fa ref_rsv=rsv.fa
# Then screen clinical samples
bbsplit.sh build=1 in=clinical_sample.fq out_covid=covid_reads.fq \
out_flu=flu_reads.fq out_rsv=rsv_reads.fq refstats=pathogen_counts.txt
Screens clinical samples for multiple respiratory pathogens simultaneously.
Handling cross-reference ambiguity
bbsplit.sh ref=closely_related1.fa,closely_related2.fa in=reads.fq \
basename=out_%.fq ambiguous2=split
Reads mapping to multiple references are written to AMBIGUOUS_closely_related1.fq and AMBIGUOUS_closely_related2.fq files for further analysis.
Paired-end output with statistics
bbsplit.sh ref=ref1.fa,ref2.fa,ref3.fa in=reads_1.fq in2=reads_2.fq \
basename=sorted_%_#.fq refstats=mapping_stats.txt scafstats=scaffold_stats.txt
Creates separate mate files (sorted_ref1_1.fq, sorted_ref1_2.fq, etc.) with comprehensive mapping statistics.
Algorithm Details
Cross-Reference Ambiguity Resolution
BBSplit's primary innovation is its handling of cross-reference ambiguity. Unlike standard mapping where ambiguous reads are those mapping to multiple locations within a genome, BBSplit distinguishes between within-reference ambiguity (multiple locations in human genome) and cross-reference ambiguity (mapping to both human and mouse genomes).
The algorithm uses the getSets() method to extract reference names from scaffold prefixes, comparing HashSet collections (p1, p2, s1, s2) representing primary and secondary reference mappings for read pairs. Reads are marked as cross-reference ambiguous when p1≠p2 or when primary sets don't contain all secondary mappings, enabling precise control over multi-species read assignment.
Reference Merging and Indexing
During indexing, BBSplit merges multiple reference files into a unified structure while preserving reference identity. The mergeReferences() method processes reference files and generates scaffold names with reference prefixes using '$' delimiter separation (e.g., "bacteria$ecoli_chromosome"). This encoding allows reads to maintain reference identity throughout the alignment pipeline while enabling efficient single-pass mapping.
The merged reference file uses a hash-based naming scheme (merged_ref_[key].fa.gz) derived from the reference set composition, enabling automatic index reuse when the same reference combination is encountered again.
Multi-Engine Mapping Support
BBSplit supports multiple alignment engines optimized for different data types:
- MAP_NORMAL: Standard BBMap for Illumina reads
- MAP_ACC: BBMapAcc with minratio=0.4 for increased sensitivity
- MAP_PACBIO: BBMapPacBio for long reads with high indel rates
- MAP_PACBIOSKIMMER: BBMapPacBioSkimmer for comprehensive alignment discovery
Statistics and Counting
BBSplit maintains detailed statistics through LinkedHashMap collections tracking both reference-level and scaffold-level metrics. The SetCount class maintains six synchronized counters: mappedReads, ambiguousReads, assignedReads, mappedBases, ambiguousBases, and assignedBases.
Statistics are updated during read processing through the addToScafCounts() method, which handles both unambiguous assignments and proportional counting for cross-reference ambiguous reads. This provides accurate quantification of taxonomic composition in metagenomic samples.
Memory Optimization
BBSplit uses automatic memory detection through calcXmx() with platform-specific optimizations. Memory allocation is based on reference size and thread count, with compression level 2 for intermediate files. The tool employs backpressure mechanisms to prevent memory exhaustion during large metagenomic analyses.
When to Use BBSplit vs Alternatives
BBSplit is ideal for:
- References longer than read length (typically >150bp for Illumina)
- Distinguishing between different species or strains
- Metagenomic read binning and taxonomic assignment
- Contamination detection and removal
- Multi-reference quantification
Use Seal instead when:
- Reference sequences are shorter than read length
- Searching for specific short sequences or adapters
- Exact kmer matching is sufficient
Use standard BBMap when:
- Mapping to a single reference genome
- Standard genomic or transcriptomic analysis
- Variant calling or SNP detection
Important Notes
- BBMap compatibility: Almost all BBMap parameters can be used with BBSplit - run bbmap.sh for complete parameter list
- Index requirement: BBSplit can only be run using references indexed with BBSplit, as they contain essential reference tracking information
- Disk usage: The 'nodisk' flag is not supported by BBSplit due to reference merging requirements
- Output formats: BBSplit is recommended for fastq and fasta output, not for sam/bam output
- Paired reads: By default, paired reads yield interleaved output unless the # symbol is used in basename
- Reference length: When reference sequences are shorter than read length, use Seal instead of BBSplit
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org