BBMapSkimmer

Core Concept: Multi-Site vs Single-Best Alignment

BBMapSkimmer fundamentally differs from standard BBMap in its alignment strategy:

Standard BBMap: Attempts to find the single best alignment site, plus some nearly-as-good sites to quantify ambiguity
BBMapSkimmer: Finds all alignment sites above a minimum score threshold, expecting many valid alignments with different scores

This approach is essential when each read is expected to have multiple correct alignment locations, such as when mapping short reads to long reads containing repetitive sequences or when analyzing multi-copy genes.

Primary Use Case: Illumina-to-PacBio Error Correction

BBMapSkimmer was originally designed for mapping Illumina reads to PacBio reads for error correction. In this workflow:

PacBio reads serve as the "reference" (long but error-prone)
Illumina reads provide high-accuracy short sequences
Each Illumina read may correctly align to multiple locations on a PacBio read
All alignment sites are needed to build consensus and correct PacBio errors

The multi-site approach is crucial because dismissing "ambiguous" alignments would lose valuable error correction information from repetitive regions.

Basic Usage

bbmapskimmer.sh ref=<reference> in=<reads> out=<sam>

BBMapSkimmer uses identical syntax to BBMap but applies different default parameters optimized for multi-site detection and PacBio processing. All BBMap parameters are supported.

Key Differences from Standard BBMap

Default Parameter Changes

fastareadlen=6000: Increased from 500 to handle PacBio read lengths up to 6kb
minratio=0.40: Reduced from 0.76 to capture more alignment sites above threshold
ambig=all: Changed from 'best' to retain all top-scoring alignment sites
secondary=true: Secondary alignments enabled by default for multi-site output
overwrite=true: Allow overwriting output files by default
build=1: Default build number for index management

Algorithm Specializations

MultiStateAligner9PacBio: Uses PacBio-optimized alignment algorithm designed for long reads with high indel rates
Adjusted key density: keyDensity=3.3, maxKeyDensity=4.3, minKeyDensity=1.8 for better long-read sensitivity
Enhanced site reporting: MAX_SITESCORES_TO_PRINT=500 allows up to 500 alignment sites per read
Extended search parameters: TIP_SEARCH_DIST=15, SLOW_ALIGN_PADDING=8 for better indel detection at read ends

Parameters

BBMapSkimmer accepts all BBMap parameters. The most relevant parameters for multi-site alignment are highlighted:

Multi-Site Alignment Parameters

ambiguous=all

(ambig) Critical parameter controlling multi-site behavior:

all - Retain all top-scoring sites (default for skimmer)
best - Use first best site only
random - Select one top-scoring site randomly
toss - Consider ambiguous reads unmapped

secondary=t

Print secondary alignments. Essential for multi-site analysis (enabled by default).

maxsites=500

Maximum number of alignment sites to print per read. BBMapSkimmer defaults to higher values.

sssr=0.95

(secondarysitescoreratio) Print secondary alignments with score ≥ this fraction of primary alignment score.

minratio=0.40

Minimum alignment score ratio. Lowered in skimmer to capture more sites (BBMap default: 0.76).

minid=0.40

Approximate minimum percent identity. Corresponds to the minratio setting.

PacBio-Optimized Parameters

fastareadlen=6000: Break FASTA reads longer than this length. Maximum for PacBio mode is 6000bp vs 500bp for standard BBMap.
maxindel=16000: Maximum indel length to search for. PacBio reads often contain large indels.
tipsearch=15: Distance to search for read-end deletions. Optimized for PacBio error profiles.
rescuedist=1200: Distance for paired read rescue operations, adjusted for longer PacBio inserts.
rescuemismatches=32: Maximum mismatches allowed in rescued reads, higher tolerance for PacBio errors.

Indexing Parameters

ref=<file>: Reference sequence file. Required for initial index building.
build=1: Build number for index management. Default is 1 (BBMap requires manual specification).
k=12: Kmer length for indexing. BBMapSkimmer defaults to 12 vs 13 for BBMap.
usemodulo=f: Reduce memory usage by ~50% at slight sensitivity cost.
nodisk=f: Build index in memory only, don't write to disk.

Input/Output Parameters

in=<file>: Input reads file (FASTA/FASTQ, optionally compressed).
in2=<file>: Second input file for paired reads.
out=<file>: Output SAM/BAM file with all alignment sites.
outu=<file>: File for unmapped reads only.
outm=<file>: File for mapped reads only (includes pairs with one mapped read).
overwrite=t: Allow overwriting existing files (enabled by default in skimmer).

Performance Control Parameters

fast=f: Enable fast mode with reduced sensitivity but higher speed.
slow=f: Enable slow mode with increased sensitivity. 'vslow' provides maximum sensitivity.
threads=auto: Number of processing threads (default: all available cores).
minApproxHits=2: Minimum approximate hits required to keep candidate sites.

Quality and Filtering Parameters

qtrim=f: Quality trimming mode: f (none), l (left), r (right), lr (both).
trimq=6: Quality threshold for trimming.
untrim=f: Restore trimmed bases as soft-clipped in output.
minaveragequality=0: Skip reads with average quality below this value.

SAM Output Parameters

sam=1.4: SAM format version. 1.4 uses = and X in CIGAR, 1.3 uses M only.
cigar=t: Generate CIGAR strings (disable with f for faster processing).
mdtag=f: Include MD tags in SAM output.
nhtag=f: Include NH (number of hits) tags.
xmtag=f: Include XM (number of mismatches) tags.

Statistics and Analysis Parameters

scafstats=<file>: Output per-scaffold mapping statistics.
ihist=<file>: Insert size histogram for paired reads.
ehist=<file>: Error rate histogram.
idhist=<file>: Identity distribution histogram.
covstats=<file>: Coverage statistics per scaffold.

Memory Management Parameters

-Xmx<size>: Java maximum heap size (e.g., -Xmx32g for 32GB). Index requires ~6 bytes per reference base.
-Xms<size>: Java initial heap size. BBMapSkimmer automatically sets this equal to -Xmx.
-eoom: Exit on out-of-memory error rather than crashing (requires Java 8u92+).

Algorithm Details

Multi-Site Detection Strategy

BBMapSkimmer implements a fundamentally different alignment philosophy:

Exhaustive site enumeration: Rather than stopping after finding the best alignment, continues searching for all sites above the minimum score threshold
Score-based filtering: Uses MINIMUM_ALIGNMENT_SCORE_RATIO=0.45 (vs 0.76 in BBMap) to cast a wider net for potential alignment sites
Enhanced site storage: Can retain up to MAX_SITESCORES_TO_PRINT=500 sites per read compared to typical 3-5 in standard alignment
Ambiguity preservation: Default ambig=all ensures no potentially valid alignment sites are discarded

PacBio-Optimized Alignment Engine

Uses MultiStateAligner9PacBio with specialized parameters for long read processing:

Extended alignment matrices: Accommodates reads up to 6000bp vs 500bp in standard BBMap
PacBio error profile: Optimized scoring for high indel rates typical of PacBio sequencing
Enhanced indel detection: TIP_SEARCH_DIST=15 for better detection of insertions/deletions at read termini
Adaptive key density: keyDensity=3.3, maxKeyDensity=4.3 provides better sensitivity for long reads with variable quality

Index Utilization Strategy

Optimized index usage for multi-site discovery:

Reduced exclusion fraction: Keeps more frequent kmers that might be excluded in single-best alignment
Genome size scaling: Automatically adjusts parameters based on reference genome size (<30M, 30-100M, 100-300M, >300M)
Extended hit retention: MIN_APPROX_HITS_TO_KEEP=2 ensures potential sites aren't prematurely discarded

Performance Characteristics

Memory usage: Similar base requirements to BBMap (~6 bytes per reference base) but may use additional RAM for storing multiple sites per read
Speed: Slower than BBMap due to exhaustive site enumeration, but optimized for the specific use cases where multiple sites are essential
Sensitivity vs specificity: Higher sensitivity for multi-mapping detection but may report more sites than needed for single-best alignment applications

Examples

Error Correction Workflow (Primary Use Case)

# Map Illumina reads to PacBio reads for error correction
# Step 1: Use PacBio reads as reference
bbmapskimmer.sh ref=pacbio_reads.fasta build=1

# Step 2: Map Illumina reads, retain all sites
bbmapskimmer.sh in=illumina_reads.fq out=error_correction.sam \
    ambig=all secondary=t maxsites=100

Basic Multi-Site Analysis

# Find all alignment sites above threshold
bbmapskimmer.sh ref=genome.fasta in=reads.fq out=multisites.sam \
    minratio=0.3 secondary=t ambig=all

Repetitive Element Mapping

# Analyze repetitive sequences with comprehensive site detection
bbmapskimmer.sh ref=genome.fasta in=reads.fq out=repeats.sam \
    minid=0.4 maxindel=5000 secondary=t maxsites=200 \
    ambig=all sssr=0.7

High-Sensitivity PacBio Processing

# Maximum sensitivity for difficult PacBio sequences
bbmapskimmer.sh ref=genome.fasta in=pacbio.fq out=sensitive.sam \
    vslow=t secondary=t ambig=all maxsites=500 \
    minratio=0.2 maxindel=20000

Fast Mode for Large Datasets

# Faster processing with reduced sensitivity
bbmapskimmer.sh ref=genome.fasta in=reads.fq out=fast.sam \
    fast=t secondary=t ambig=all maxsites=50

Memory-Optimized Processing

# Process large genomes with memory constraints
bbmapskimmer.sh ref=large_genome.fasta usemodulo=t build=1
bbmapskimmer.sh in=reads.fq out=mapped.sam usemodulo=t \
    ambig=all secondary=t -Xmx16g

Use Cases and Applications

Primary Applications

Illumina-to-PacBio Error Correction: Original design purpose: mapping accurate short reads to error-prone long reads to build consensus and correct sequencing errors. Essential for PacBio polishing workflows.
Multi-Copy Gene Analysis: Analyzing reads that should legitimately map to multiple locations, such as ribosomal RNA genes, histone clusters, or other multi-copy loci.
Repetitive Element Studies: Comprehensive mapping of reads to all instances of repetitive elements, transposons, or tandem repeats where standard "unique mapping" would miss important sites.
Structural Variation Detection: Finding reads with multiple alignment sites that may indicate duplications, inversions, or complex rearrangements requiring multi-site analysis.
Paralog Identification: Mapping reads to all members of gene families or pseudogenes to understand sequence relationships and evolution.

Specialized Workflows

Chimeric read detection: Identifying reads with multiple non-overlapping alignment regions
Assembly validation: Verifying that reads map consistently across duplicated or repetitive assembly regions
Metagenomics: Mapping reads that may originate from multiple related organisms or strains
Ancient DNA analysis: Processing degraded DNA that may produce fragmented alignments requiring multiple sites

When NOT to Use BBMapSkimmer

Standard genomic analysis: When only the best alignment per read is needed
RNA-seq quantification: When unique mappings are required for accurate expression estimates
Variant calling: When single-best alignments provide cleaner variant calls
Limited computational resources: When speed is more important than comprehensive site detection
Short read processing: When standard BBMap parameters are sufficient

Considerations for Use

Computational cost: Slower than standard BBMap due to exhaustive site search
Output size: Larger SAM files due to multiple alignments per read
Downstream analysis: Ensure downstream tools can handle multiple alignments per read
Memory requirements: May need more RAM when processing highly repetitive sequences

Technical Notes

Relationship to Other BBMap Variants

BBMap: Standard mapper focused on finding single best alignment
BBMapSkimmer: Multi-site mapper for finding all alignments above threshold
mapPacBio.sh: Single-best PacBio mapper with long-read optimizations
BBSplit: Multi-reference mapper for taxonomic binning

Integration with BBTools Ecosystem

BBMapSkimmer integrates with other BBTools for complete workflows:

Reformat: Process input/output files and convert between formats
FilterByTile: Pre-filter reads by sequencing tile quality
Pileup: Generate coverage statistics from multi-site alignments
Stats: Analyze reference sequences before indexing

Performance Optimization Tips

Use usemodulo=t to reduce memory usage by ~50%
Adjust maxsites based on expected repeat content
Use fast=t for initial exploration, vslow=t for comprehensive analysis
Build index once, then map multiple datasets without ref= parameter

Output Interpretation

Primary alignments: Best scoring alignment per read
Secondary alignments: Additional sites above score threshold
NH tag: Number of alignment sites reported for each read
Score tags: BBMap's internal alignment scores (if enabled)

Support and Resources

Getting Help

Primary contact: bbushnell@lbl.gov
Documentation: bbmap.org

Citation

When using BBMapSkimmer in publications, please cite BBMap:

Bushnell, B. (2014). BBMap: A Fast, Accurate, Splice-Aware Aligner. Lawrence Berkeley National Laboratory. LBNL-7065E.