BBMapSkimmer

Script: bbmapskimmer.sh Package: align2 Class: BBMapPacBioSkimmer.java

Multi-site alignment variant of BBMap designed to find all alignment sites above a threshold rather than just the best site. Originally developed for mapping Illumina reads to PacBio reads in error correction workflows, with specialized parameters for long read processing.

Core Concept: Multi-Site vs Single-Best Alignment

BBMapSkimmer fundamentally differs from standard BBMap in its alignment strategy:

  • Standard BBMap: Attempts to find the single best alignment site, plus some nearly-as-good sites to quantify ambiguity
  • BBMapSkimmer: Finds all alignment sites above a minimum score threshold, expecting many valid alignments with different scores

This approach is essential when each read is expected to have multiple correct alignment locations, such as when mapping short reads to long reads containing repetitive sequences or when analyzing multi-copy genes.

Primary Use Case: Illumina-to-PacBio Error Correction

BBMapSkimmer was originally designed for mapping Illumina reads to PacBio reads for error correction. In this workflow:

  • PacBio reads serve as the "reference" (long but error-prone)
  • Illumina reads provide high-accuracy short sequences
  • Each Illumina read may correctly align to multiple locations on a PacBio read
  • All alignment sites are needed to build consensus and correct PacBio errors

The multi-site approach is crucial because dismissing "ambiguous" alignments would lose valuable error correction information from repetitive regions.

Basic Usage

bbmapskimmer.sh ref=<reference> in=<reads> out=<sam>

BBMapSkimmer uses identical syntax to BBMap but applies different default parameters optimized for multi-site detection and PacBio processing. All BBMap parameters are supported.

Key Differences from Standard BBMap

Default Parameter Changes

fastareadlen=6000
Increased from 500 to handle PacBio read lengths up to 6kb
minratio=0.40
Reduced from 0.76 to capture more alignment sites above threshold
ambig=all
Changed from 'best' to retain all top-scoring alignment sites
secondary=true
Secondary alignments enabled by default for multi-site output
overwrite=true
Allow overwriting output files by default
build=1
Default build number for index management

Algorithm Specializations

  • MultiStateAligner9PacBio: Uses PacBio-optimized alignment algorithm designed for long reads with high indel rates
  • Adjusted key density: keyDensity=3.3, maxKeyDensity=4.3, minKeyDensity=1.8 for better long-read sensitivity
  • Enhanced site reporting: MAX_SITESCORES_TO_PRINT=500 allows up to 500 alignment sites per read
  • Extended search parameters: TIP_SEARCH_DIST=15, SLOW_ALIGN_PADDING=8 for better indel detection at read ends

Parameters

BBMapSkimmer accepts all BBMap parameters. The most relevant parameters for multi-site alignment are highlighted:

Multi-Site Alignment Parameters

ambiguous=all
(ambig) Critical parameter controlling multi-site behavior:
  • all - Retain all top-scoring sites (default for skimmer)
  • best - Use first best site only
  • random - Select one top-scoring site randomly
  • toss - Consider ambiguous reads unmapped
secondary=t
Print secondary alignments. Essential for multi-site analysis (enabled by default).
maxsites=500
Maximum number of alignment sites to print per read. BBMapSkimmer defaults to higher values.
sssr=0.95
(secondarysitescoreratio) Print secondary alignments with score ≥ this fraction of primary alignment score.
minratio=0.40
Minimum alignment score ratio. Lowered in skimmer to capture more sites (BBMap default: 0.76).
minid=0.40
Approximate minimum percent identity. Corresponds to the minratio setting.

PacBio-Optimized Parameters

fastareadlen=6000
Break FASTA reads longer than this length. Maximum for PacBio mode is 6000bp vs 500bp for standard BBMap.
maxindel=16000
Maximum indel length to search for. PacBio reads often contain large indels.
tipsearch=15
Distance to search for read-end deletions. Optimized for PacBio error profiles.
rescuedist=1200
Distance for paired read rescue operations, adjusted for longer PacBio inserts.
rescuemismatches=32
Maximum mismatches allowed in rescued reads, higher tolerance for PacBio errors.

Indexing Parameters

ref=<file>
Reference sequence file. Required for initial index building.
build=1
Build number for index management. Default is 1 (BBMap requires manual specification).
k=12
Kmer length for indexing. BBMapSkimmer defaults to 12 vs 13 for BBMap.
usemodulo=f
Reduce memory usage by ~50% at slight sensitivity cost.
nodisk=f
Build index in memory only, don't write to disk.

Input/Output Parameters

in=<file>
Input reads file (FASTA/FASTQ, optionally compressed).
in2=<file>
Second input file for paired reads.
out=<file>
Output SAM/BAM file with all alignment sites.
outu=<file>
File for unmapped reads only.
outm=<file>
File for mapped reads only (includes pairs with one mapped read).
overwrite=t
Allow overwriting existing files (enabled by default in skimmer).

Performance Control Parameters

fast=f
Enable fast mode with reduced sensitivity but higher speed.
slow=f
Enable slow mode with increased sensitivity. 'vslow' provides maximum sensitivity.
threads=auto
Number of processing threads (default: all available cores).
minApproxHits=2
Minimum approximate hits required to keep candidate sites.

Quality and Filtering Parameters

qtrim=f
Quality trimming mode: f (none), l (left), r (right), lr (both).
trimq=6
Quality threshold for trimming.
untrim=f
Restore trimmed bases as soft-clipped in output.
minaveragequality=0
Skip reads with average quality below this value.

SAM Output Parameters

sam=1.4
SAM format version. 1.4 uses = and X in CIGAR, 1.3 uses M only.
cigar=t
Generate CIGAR strings (disable with f for faster processing).
mdtag=f
Include MD tags in SAM output.
nhtag=f
Include NH (number of hits) tags.
xmtag=f
Include XM (number of mismatches) tags.

Statistics and Analysis Parameters

scafstats=<file>
Output per-scaffold mapping statistics.
ihist=<file>
Insert size histogram for paired reads.
ehist=<file>
Error rate histogram.
idhist=<file>
Identity distribution histogram.
covstats=<file>
Coverage statistics per scaffold.

Memory Management Parameters

-Xmx<size>
Java maximum heap size (e.g., -Xmx32g for 32GB). Index requires ~6 bytes per reference base.
-Xms<size>
Java initial heap size. BBMapSkimmer automatically sets this equal to -Xmx.
-eoom
Exit on out-of-memory error rather than crashing (requires Java 8u92+).

Algorithm Details

Multi-Site Detection Strategy

BBMapSkimmer implements a fundamentally different alignment philosophy:

  • Exhaustive site enumeration: Rather than stopping after finding the best alignment, continues searching for all sites above the minimum score threshold
  • Score-based filtering: Uses MINIMUM_ALIGNMENT_SCORE_RATIO=0.45 (vs 0.76 in BBMap) to cast a wider net for potential alignment sites
  • Enhanced site storage: Can retain up to MAX_SITESCORES_TO_PRINT=500 sites per read compared to typical 3-5 in standard alignment
  • Ambiguity preservation: Default ambig=all ensures no potentially valid alignment sites are discarded

PacBio-Optimized Alignment Engine

Uses MultiStateAligner9PacBio with specialized parameters for long read processing:

  • Extended alignment matrices: Accommodates reads up to 6000bp vs 500bp in standard BBMap
  • PacBio error profile: Optimized scoring for high indel rates typical of PacBio sequencing
  • Enhanced indel detection: TIP_SEARCH_DIST=15 for better detection of insertions/deletions at read termini
  • Adaptive key density: keyDensity=3.3, maxKeyDensity=4.3 provides better sensitivity for long reads with variable quality

Index Utilization Strategy

Optimized index usage for multi-site discovery:

  • Reduced exclusion fraction: Keeps more frequent kmers that might be excluded in single-best alignment
  • Genome size scaling: Automatically adjusts parameters based on reference genome size (<30M, 30-100M, 100-300M, >300M)
  • Extended hit retention: MIN_APPROX_HITS_TO_KEEP=2 ensures potential sites aren't prematurely discarded

Performance Characteristics

  • Memory usage: Similar base requirements to BBMap (~6 bytes per reference base) but may use additional RAM for storing multiple sites per read
  • Speed: Slower than BBMap due to exhaustive site enumeration, but optimized for the specific use cases where multiple sites are essential
  • Sensitivity vs specificity: Higher sensitivity for multi-mapping detection but may report more sites than needed for single-best alignment applications

Examples

Error Correction Workflow (Primary Use Case)

# Map Illumina reads to PacBio reads for error correction
# Step 1: Use PacBio reads as reference
bbmapskimmer.sh ref=pacbio_reads.fasta build=1

# Step 2: Map Illumina reads, retain all sites
bbmapskimmer.sh in=illumina_reads.fq out=error_correction.sam \
    ambig=all secondary=t maxsites=100

Basic Multi-Site Analysis

# Find all alignment sites above threshold
bbmapskimmer.sh ref=genome.fasta in=reads.fq out=multisites.sam \
    minratio=0.3 secondary=t ambig=all

Repetitive Element Mapping

# Analyze repetitive sequences with comprehensive site detection
bbmapskimmer.sh ref=genome.fasta in=reads.fq out=repeats.sam \
    minid=0.4 maxindel=5000 secondary=t maxsites=200 \
    ambig=all sssr=0.7

High-Sensitivity PacBio Processing

# Maximum sensitivity for difficult PacBio sequences
bbmapskimmer.sh ref=genome.fasta in=pacbio.fq out=sensitive.sam \
    vslow=t secondary=t ambig=all maxsites=500 \
    minratio=0.2 maxindel=20000

Fast Mode for Large Datasets

# Faster processing with reduced sensitivity
bbmapskimmer.sh ref=genome.fasta in=reads.fq out=fast.sam \
    fast=t secondary=t ambig=all maxsites=50

Memory-Optimized Processing

# Process large genomes with memory constraints
bbmapskimmer.sh ref=large_genome.fasta usemodulo=t build=1
bbmapskimmer.sh in=reads.fq out=mapped.sam usemodulo=t \
    ambig=all secondary=t -Xmx16g

Use Cases and Applications

Primary Applications

Illumina-to-PacBio Error Correction
Original design purpose: mapping accurate short reads to error-prone long reads to build consensus and correct sequencing errors. Essential for PacBio polishing workflows.
Multi-Copy Gene Analysis
Analyzing reads that should legitimately map to multiple locations, such as ribosomal RNA genes, histone clusters, or other multi-copy loci.
Repetitive Element Studies
Comprehensive mapping of reads to all instances of repetitive elements, transposons, or tandem repeats where standard "unique mapping" would miss important sites.
Structural Variation Detection
Finding reads with multiple alignment sites that may indicate duplications, inversions, or complex rearrangements requiring multi-site analysis.
Paralog Identification
Mapping reads to all members of gene families or pseudogenes to understand sequence relationships and evolution.

Specialized Workflows

  • Chimeric read detection: Identifying reads with multiple non-overlapping alignment regions
  • Assembly validation: Verifying that reads map consistently across duplicated or repetitive assembly regions
  • Metagenomics: Mapping reads that may originate from multiple related organisms or strains
  • Ancient DNA analysis: Processing degraded DNA that may produce fragmented alignments requiring multiple sites

When NOT to Use BBMapSkimmer

  • Standard genomic analysis: When only the best alignment per read is needed
  • RNA-seq quantification: When unique mappings are required for accurate expression estimates
  • Variant calling: When single-best alignments provide cleaner variant calls
  • Limited computational resources: When speed is more important than comprehensive site detection
  • Short read processing: When standard BBMap parameters are sufficient

Considerations for Use

  • Computational cost: Slower than standard BBMap due to exhaustive site search
  • Output size: Larger SAM files due to multiple alignments per read
  • Downstream analysis: Ensure downstream tools can handle multiple alignments per read
  • Memory requirements: May need more RAM when processing highly repetitive sequences

Technical Notes

Relationship to Other BBMap Variants

  • BBMap: Standard mapper focused on finding single best alignment
  • BBMapSkimmer: Multi-site mapper for finding all alignments above threshold
  • mapPacBio.sh: Single-best PacBio mapper with long-read optimizations
  • BBSplit: Multi-reference mapper for taxonomic binning

Integration with BBTools Ecosystem

BBMapSkimmer integrates with other BBTools for complete workflows:

  • Reformat: Process input/output files and convert between formats
  • FilterByTile: Pre-filter reads by sequencing tile quality
  • Pileup: Generate coverage statistics from multi-site alignments
  • Stats: Analyze reference sequences before indexing

Performance Optimization Tips

  • Use usemodulo=t to reduce memory usage by ~50%
  • Adjust maxsites based on expected repeat content
  • Use fast=t for initial exploration, vslow=t for comprehensive analysis
  • Build index once, then map multiple datasets without ref= parameter

Output Interpretation

  • Primary alignments: Best scoring alignment per read
  • Secondary alignments: Additional sites above score threshold
  • NH tag: Number of alignment sites reported for each read
  • Score tags: BBMap's internal alignment scores (if enabled)

Support and Resources

Getting Help

Related Documentation

  • BBMapGuide.txt: Comprehensive mapping strategies and use cases
  • UsageGuide.txt: General BBTools parameter syntax
  • readme.txt: Installation and basic setup instructions

Citation

When using BBMapSkimmer in publications, please cite BBMap:

Bushnell, B. (2014). BBMap: A Fast, Accurate, Splice-Aware Aligner. Lawrence Berkeley National Laboratory. LBNL-7065E.