BBMapSkimmer
Multi-site alignment variant of BBMap designed to find all alignment sites above a threshold rather than just the best site. Originally developed for mapping Illumina reads to PacBio reads in error correction workflows, with specialized parameters for long read processing.
Core Concept: Multi-Site vs Single-Best Alignment
BBMapSkimmer fundamentally differs from standard BBMap in its alignment strategy:
- Standard BBMap: Attempts to find the single best alignment site, plus some nearly-as-good sites to quantify ambiguity
- BBMapSkimmer: Finds all alignment sites above a minimum score threshold, expecting many valid alignments with different scores
This approach is essential when each read is expected to have multiple correct alignment locations, such as when mapping short reads to long reads containing repetitive sequences or when analyzing multi-copy genes.
Primary Use Case: Illumina-to-PacBio Error Correction
BBMapSkimmer was originally designed for mapping Illumina reads to PacBio reads for error correction. In this workflow:
- PacBio reads serve as the "reference" (long but error-prone)
- Illumina reads provide high-accuracy short sequences
- Each Illumina read may correctly align to multiple locations on a PacBio read
- All alignment sites are needed to build consensus and correct PacBio errors
The multi-site approach is crucial because dismissing "ambiguous" alignments would lose valuable error correction information from repetitive regions.
Basic Usage
bbmapskimmer.sh ref=<reference> in=<reads> out=<sam>
BBMapSkimmer uses identical syntax to BBMap but applies different default parameters optimized for multi-site detection and PacBio processing. All BBMap parameters are supported.
Key Differences from Standard BBMap
Default Parameter Changes
- fastareadlen=6000
- Increased from 500 to handle PacBio read lengths up to 6kb
- minratio=0.40
- Reduced from 0.76 to capture more alignment sites above threshold
- ambig=all
- Changed from 'best' to retain all top-scoring alignment sites
- secondary=true
- Secondary alignments enabled by default for multi-site output
- overwrite=true
- Allow overwriting output files by default
- build=1
- Default build number for index management
Algorithm Specializations
- MultiStateAligner9PacBio: Uses PacBio-optimized alignment algorithm designed for long reads with high indel rates
- Adjusted key density: keyDensity=3.3, maxKeyDensity=4.3, minKeyDensity=1.8 for better long-read sensitivity
- Enhanced site reporting: MAX_SITESCORES_TO_PRINT=500 allows up to 500 alignment sites per read
- Extended search parameters: TIP_SEARCH_DIST=15, SLOW_ALIGN_PADDING=8 for better indel detection at read ends
Parameters
BBMapSkimmer accepts all BBMap parameters. The most relevant parameters for multi-site alignment are highlighted:
Multi-Site Alignment Parameters
- ambiguous=all
- (ambig) Critical parameter controlling multi-site behavior:
- all - Retain all top-scoring sites (default for skimmer)
- best - Use first best site only
- random - Select one top-scoring site randomly
- toss - Consider ambiguous reads unmapped
- secondary=t
- Print secondary alignments. Essential for multi-site analysis (enabled by default).
- maxsites=500
- Maximum number of alignment sites to print per read. BBMapSkimmer defaults to higher values.
- sssr=0.95
- (secondarysitescoreratio) Print secondary alignments with score ≥ this fraction of primary alignment score.
- minratio=0.40
- Minimum alignment score ratio. Lowered in skimmer to capture more sites (BBMap default: 0.76).
- minid=0.40
- Approximate minimum percent identity. Corresponds to the minratio setting.
PacBio-Optimized Parameters
- fastareadlen=6000
- Break FASTA reads longer than this length. Maximum for PacBio mode is 6000bp vs 500bp for standard BBMap.
- maxindel=16000
- Maximum indel length to search for. PacBio reads often contain large indels.
- tipsearch=15
- Distance to search for read-end deletions. Optimized for PacBio error profiles.
- rescuedist=1200
- Distance for paired read rescue operations, adjusted for longer PacBio inserts.
- rescuemismatches=32
- Maximum mismatches allowed in rescued reads, higher tolerance for PacBio errors.
Indexing Parameters
- ref=<file>
- Reference sequence file. Required for initial index building.
- build=1
- Build number for index management. Default is 1 (BBMap requires manual specification).
- k=12
- Kmer length for indexing. BBMapSkimmer defaults to 12 vs 13 for BBMap.
- usemodulo=f
- Reduce memory usage by ~50% at slight sensitivity cost.
- nodisk=f
- Build index in memory only, don't write to disk.
Input/Output Parameters
- in=<file>
- Input reads file (FASTA/FASTQ, optionally compressed).
- in2=<file>
- Second input file for paired reads.
- out=<file>
- Output SAM/BAM file with all alignment sites.
- outu=<file>
- File for unmapped reads only.
- outm=<file>
- File for mapped reads only (includes pairs with one mapped read).
- overwrite=t
- Allow overwriting existing files (enabled by default in skimmer).
Performance Control Parameters
- fast=f
- Enable fast mode with reduced sensitivity but higher speed.
- slow=f
- Enable slow mode with increased sensitivity. 'vslow' provides maximum sensitivity.
- threads=auto
- Number of processing threads (default: all available cores).
- minApproxHits=2
- Minimum approximate hits required to keep candidate sites.
Quality and Filtering Parameters
- qtrim=f
- Quality trimming mode: f (none), l (left), r (right), lr (both).
- trimq=6
- Quality threshold for trimming.
- untrim=f
- Restore trimmed bases as soft-clipped in output.
- minaveragequality=0
- Skip reads with average quality below this value.
SAM Output Parameters
- sam=1.4
- SAM format version. 1.4 uses = and X in CIGAR, 1.3 uses M only.
- cigar=t
- Generate CIGAR strings (disable with f for faster processing).
- mdtag=f
- Include MD tags in SAM output.
- nhtag=f
- Include NH (number of hits) tags.
- xmtag=f
- Include XM (number of mismatches) tags.
Statistics and Analysis Parameters
- scafstats=<file>
- Output per-scaffold mapping statistics.
- ihist=<file>
- Insert size histogram for paired reads.
- ehist=<file>
- Error rate histogram.
- idhist=<file>
- Identity distribution histogram.
- covstats=<file>
- Coverage statistics per scaffold.
Memory Management Parameters
- -Xmx<size>
- Java maximum heap size (e.g., -Xmx32g for 32GB). Index requires ~6 bytes per reference base.
- -Xms<size>
- Java initial heap size. BBMapSkimmer automatically sets this equal to -Xmx.
- -eoom
- Exit on out-of-memory error rather than crashing (requires Java 8u92+).
Algorithm Details
Multi-Site Detection Strategy
BBMapSkimmer implements a fundamentally different alignment philosophy:
- Exhaustive site enumeration: Rather than stopping after finding the best alignment, continues searching for all sites above the minimum score threshold
- Score-based filtering: Uses MINIMUM_ALIGNMENT_SCORE_RATIO=0.45 (vs 0.76 in BBMap) to cast a wider net for potential alignment sites
- Enhanced site storage: Can retain up to MAX_SITESCORES_TO_PRINT=500 sites per read compared to typical 3-5 in standard alignment
- Ambiguity preservation: Default ambig=all ensures no potentially valid alignment sites are discarded
PacBio-Optimized Alignment Engine
Uses MultiStateAligner9PacBio with specialized parameters for long read processing:
- Extended alignment matrices: Accommodates reads up to 6000bp vs 500bp in standard BBMap
- PacBio error profile: Optimized scoring for high indel rates typical of PacBio sequencing
- Enhanced indel detection: TIP_SEARCH_DIST=15 for better detection of insertions/deletions at read termini
- Adaptive key density: keyDensity=3.3, maxKeyDensity=4.3 provides better sensitivity for long reads with variable quality
Index Utilization Strategy
Optimized index usage for multi-site discovery:
- Reduced exclusion fraction: Keeps more frequent kmers that might be excluded in single-best alignment
- Genome size scaling: Automatically adjusts parameters based on reference genome size (<30M, 30-100M, 100-300M, >300M)
- Extended hit retention: MIN_APPROX_HITS_TO_KEEP=2 ensures potential sites aren't prematurely discarded
Performance Characteristics
- Memory usage: Similar base requirements to BBMap (~6 bytes per reference base) but may use additional RAM for storing multiple sites per read
- Speed: Slower than BBMap due to exhaustive site enumeration, but optimized for the specific use cases where multiple sites are essential
- Sensitivity vs specificity: Higher sensitivity for multi-mapping detection but may report more sites than needed for single-best alignment applications
Examples
Error Correction Workflow (Primary Use Case)
# Map Illumina reads to PacBio reads for error correction
# Step 1: Use PacBio reads as reference
bbmapskimmer.sh ref=pacbio_reads.fasta build=1
# Step 2: Map Illumina reads, retain all sites
bbmapskimmer.sh in=illumina_reads.fq out=error_correction.sam \
ambig=all secondary=t maxsites=100
Basic Multi-Site Analysis
# Find all alignment sites above threshold
bbmapskimmer.sh ref=genome.fasta in=reads.fq out=multisites.sam \
minratio=0.3 secondary=t ambig=all
Repetitive Element Mapping
# Analyze repetitive sequences with comprehensive site detection
bbmapskimmer.sh ref=genome.fasta in=reads.fq out=repeats.sam \
minid=0.4 maxindel=5000 secondary=t maxsites=200 \
ambig=all sssr=0.7
High-Sensitivity PacBio Processing
# Maximum sensitivity for difficult PacBio sequences
bbmapskimmer.sh ref=genome.fasta in=pacbio.fq out=sensitive.sam \
vslow=t secondary=t ambig=all maxsites=500 \
minratio=0.2 maxindel=20000
Fast Mode for Large Datasets
# Faster processing with reduced sensitivity
bbmapskimmer.sh ref=genome.fasta in=reads.fq out=fast.sam \
fast=t secondary=t ambig=all maxsites=50
Memory-Optimized Processing
# Process large genomes with memory constraints
bbmapskimmer.sh ref=large_genome.fasta usemodulo=t build=1
bbmapskimmer.sh in=reads.fq out=mapped.sam usemodulo=t \
ambig=all secondary=t -Xmx16g
Use Cases and Applications
Primary Applications
- Illumina-to-PacBio Error Correction
- Original design purpose: mapping accurate short reads to error-prone long reads to build consensus and correct sequencing errors. Essential for PacBio polishing workflows.
- Multi-Copy Gene Analysis
- Analyzing reads that should legitimately map to multiple locations, such as ribosomal RNA genes, histone clusters, or other multi-copy loci.
- Repetitive Element Studies
- Comprehensive mapping of reads to all instances of repetitive elements, transposons, or tandem repeats where standard "unique mapping" would miss important sites.
- Structural Variation Detection
- Finding reads with multiple alignment sites that may indicate duplications, inversions, or complex rearrangements requiring multi-site analysis.
- Paralog Identification
- Mapping reads to all members of gene families or pseudogenes to understand sequence relationships and evolution.
Specialized Workflows
- Chimeric read detection: Identifying reads with multiple non-overlapping alignment regions
- Assembly validation: Verifying that reads map consistently across duplicated or repetitive assembly regions
- Metagenomics: Mapping reads that may originate from multiple related organisms or strains
- Ancient DNA analysis: Processing degraded DNA that may produce fragmented alignments requiring multiple sites
When NOT to Use BBMapSkimmer
- Standard genomic analysis: When only the best alignment per read is needed
- RNA-seq quantification: When unique mappings are required for accurate expression estimates
- Variant calling: When single-best alignments provide cleaner variant calls
- Limited computational resources: When speed is more important than comprehensive site detection
- Short read processing: When standard BBMap parameters are sufficient
Considerations for Use
- Computational cost: Slower than standard BBMap due to exhaustive site search
- Output size: Larger SAM files due to multiple alignments per read
- Downstream analysis: Ensure downstream tools can handle multiple alignments per read
- Memory requirements: May need more RAM when processing highly repetitive sequences
Technical Notes
Relationship to Other BBMap Variants
- BBMap: Standard mapper focused on finding single best alignment
- BBMapSkimmer: Multi-site mapper for finding all alignments above threshold
- mapPacBio.sh: Single-best PacBio mapper with long-read optimizations
- BBSplit: Multi-reference mapper for taxonomic binning
Integration with BBTools Ecosystem
BBMapSkimmer integrates with other BBTools for complete workflows:
- Reformat: Process input/output files and convert between formats
- FilterByTile: Pre-filter reads by sequencing tile quality
- Pileup: Generate coverage statistics from multi-site alignments
- Stats: Analyze reference sequences before indexing
Performance Optimization Tips
- Use
usemodulo=t
to reduce memory usage by ~50% - Adjust
maxsites
based on expected repeat content - Use
fast=t
for initial exploration,vslow=t
for comprehensive analysis - Build index once, then map multiple datasets without
ref=
parameter
Output Interpretation
- Primary alignments: Best scoring alignment per read
- Secondary alignments: Additional sites above score threshold
- NH tag: Number of alignment sites reported for each read
- Score tags: BBMap's internal alignment scores (if enabled)
Support and Resources
Getting Help
- Primary contact: bbushnell@lbl.gov
- Documentation: bbmap.org
Related Documentation
- BBMapGuide.txt: Comprehensive mapping strategies and use cases
- UsageGuide.txt: General BBTools parameter syntax
- readme.txt: Installation and basic setup instructions
Citation
When using BBMapSkimmer in publications, please cite BBMap:
Bushnell, B. (2014). BBMap: A Fast, Accurate, Splice-Aware Aligner. Lawrence Berkeley National Laboratory. LBNL-7065E.