MapPacBio
BBMap variant optimized for PacBio long reads and Nanopore sequencing data, featuring specialized error weight profiles for high error rates dominated by short indels
Description
MapPacBio is a BBMap variant specifically optimized for long-read sequencing technologies. While standard BBMap supports reads up to 700bp, MapPacBio extends support to reads up to 6kbp and employs a specialized error weight profile designed for long reads with high error rates (10-15%) dominated by short indels. This variant is the recommended version for both PacBio and Nanopore data.
The tool uses the MultiStateAligner9PacBio alignment engine with PacBio-tuned scoring matrices and implements a minimum alignment score ratio of 0.46 (allowing ~46% identity alignments) to accommodate the characteristic error profiles of long-read technologies. Note that globally optimal alignments may occasionally differ between MapPacBio and standard BBMap when processing Illumina data due to these specialized optimizations.
MapPacBio includes automatic HPC environment configuration for NERSC systems and uses memory-optimized threading with 680MB per-thread baseline allocation to handle the increased computational requirements of longer alignment matrices.
Technology Support
Primary Target Platforms
- PacBio CLR/HiFi: Optimized for characteristic short indel error patterns and variable quality profiles
- Oxford Nanopore: Recommended variant for Nanopore R9/R10 chemistry with appropriate error tolerance
- Legacy Long Reads: Supports 454, Sanger, and Ion Torrent long-read chemistries
Error Profile Optimization
MapPacBio's error weight profile is specifically calibrated for sequencing technologies with:
- High overall error rates (10-15% vs 2-3% for Illumina)
- Error distributions dominated by short insertions and deletions
- Variable quality along read length with systematic base-calling biases
- Context-dependent error patterns requiring specialized gap penalty structures
Basic Usage
mapPacBio.sh ref=reference.fa in=pacbio_reads.fq out=mapped.sam
mapPacBio.sh build=1 in=nanopore_reads.fq out=mapped.sam
mapPacBio.sh ref=genome.fa in=long_reads.fq out=mapped.sam nodisk
PacBio-Specific Parameter Configuration
MapPacBio applies several default parameter modifications optimized for long-read characteristics:
Read Length and Processing
- fastareadlen=6000: Extended from 500bp to support PacBio/Nanopore read lengths up to 6kbp
- minscaf=100: Minimum scaffold length for alignment consideration
- overwrite=true: Default file overwriting enabled for iterative analysis workflows
Error Tolerance Adjustments
- minratio=0.40: Reduced from 0.76 to accommodate 10-15% error rates typical of long-read sequencing
- MINIMUM_ALIGNMENT_SCORE_RATIO=0.46f: Internal Java setting allowing ~46% identity alignments
- ambiguous=best: Optimized handling of ambiguously-mapped reads in high-error contexts
K-mer and Indexing Optimizations
- keylen=12: Reduced from 13 to balance sensitivity and specificity for noisy sequences
- keyDensity=3.5: Increased from 2.0 to improve seed coverage in error-prone regions
- maxKeyDensity=4.5: Upper bound allowing adaptive density scaling
- minKeyDensity×2.5: Ensures minimum seed coverage even in highly variable regions
- maxDesiredKeys=63: Increased from 40 to accommodate sparser reliable seed distribution
Memory and Threading Architecture
- startpad=10000, stoppad=10000, midpad=6000: Sequence padding optimizations for long-read processing
- FastaToChromArrays2.MID_PADDING=2000: Reference handling optimization
- adjustThreadsforMemory(680): 680MB per-thread allocation for extended alignment matrices
- Shared.capBufferLen(20): Buffer management to prevent memory fragmentation
Parameters
MapPacBio accepts all standard BBMap parameters with PacBio-optimized defaults. Key parameters for long-read alignment are listed below. For comprehensive parameter documentation, see the BBMap documentation.
Indexing Parameters
- ref=<file>
- Reference sequence file for index construction. Required when building new indices.
- build=1
- Build identifier for reference index. Each reference requires a unique build number.
- k=13
- Kmer length for indexing (default 13 for BBMap, optimized to 12 for PacBio/Nanopore).
- nodisk=f
- Build index in memory without disk storage. Useful for temporary analyses or limited disk space.
Input Parameters
- in=<file>
- Primary reads input file (FASTA/FASTQ, compressed or uncompressed). Required parameter.
- in2=<file>
- Second file for paired reads in separate files (uncommon for PacBio/Nanopore).
- interleaved=auto
- Force paired/interleaved input processing (true) or single-ended mapping (false).
- fastareadlen=6000
- Fragment FASTA reads longer than this value. Default 6000bp for long-read support (vs 500bp standard).
Mapping Parameters
- minratio=0.40
- Minimum alignment score ratio for read acceptance. Reduced from 0.76 to accommodate long-read error rates.
- maxindel=16000
- Maximum indel length for explicit search. PacBio/Nanopore can have structural variants exceeding read length.
- tipsearch=15
- Search distance for terminal deletions with short anchors. Optimized for long-read end effects.
- minhits=1
- Minimum seed hits required for candidate site consideration. Reduced for error-prone sequences.
- ambiguous=best
- Handling of ambiguously-mapped reads. Options: best (first), toss, random, all.
- fast=f
- Enable faster, less sensitive alignment settings with reduced computational overhead.
- slow=f
- Enable slower, more sensitive settings. Use 'vslow' for maximum sensitivity.
- vslow=f
- Very slow mode with maximum sensitivity (minratio=0.22, extended rescue parameters).
Output Parameters
- out=<file>
- Primary output file for all processed reads (SAM/BAM format recommended).
- outm=<file>
- Output file for successfully mapped reads only.
- outu=<file>
- Output file for unmapped reads only.
- overwrite=true
- Allow overwriting existing output files. Default enabled for MapPacBio workflows.
- secondary=f
- Include secondary alignments in output (increases file size significantly).
Quality and Trimming Parameters
- qtrim=f
- Quality-based read trimming before mapping. Options: f (none), l (left), r (right), lr (both).
- trimq=6
- Quality threshold for trimming operations (Phred scale).
- mintrimlength=60
- Minimum read length after trimming to prevent over-trimming artifacts.
- minaveragequality=0
- Minimum average read quality for mapping consideration (0 disables filter).
Performance Parameters
- threads=auto
- Threading level (auto-detected based on available cores and memory constraints).
- -Xmx<size>
- Java heap memory allocation. Long-read alignment typically requires increased memory allocation.
- startpad=10000
- Padding at reference sequence starts (optimized for long-read boundary effects).
- stoppad=10000
- Padding at reference sequence ends (optimized for long-read boundary effects).
- midpad=6000
- Inter-sequence padding in concatenated references (long-read specific optimization).
Examples
Basic PacBio/Nanopore Alignment
mapPacBio.sh ref=genome.fa in=long_reads.fq out=aligned.sam
Standard long-read alignment with PacBio-optimized defaults for error tolerance and read length support.
Using Pre-built Index
mapPacBio.sh build=1 in=pacbio_reads.fq out=aligned.sam
Map to previously indexed reference (build=1) to avoid re-indexing overhead.
Maximum Sensitivity Alignment
mapPacBio.sh in=reads.fq out=mapped.sam vslow k=8 maxindel=200 minratio=0.1
Super-high sensitivity mode for difficult alignments, highly divergent sequences, or contamination screening. Uses 8-mer seeds and accepts alignments with only 10% identity.
Fast Mode for Large Datasets
mapPacBio.sh ref=genome.fa in=reads.fq out=aligned.sam fast
Accelerated processing for large datasets with acceptable sensitivity trade-offs (5× reduction in tip search).
Memory-Constrained Environment
mapPacBio.sh ref=genome.fa in=reads.fq out=aligned.sam -Xmx16g nodisk
Limited memory allocation with in-memory indexing when disk space is constrained.
High-Quality Nanopore Reads
mapPacBio.sh ref=genome.fa in=nanopore_hq.fq out=aligned.sam minratio=0.60
Higher stringency alignment for high-quality Nanopore data (R10+ chemistry) with reduced error rates.
Algorithm Details
MapPacBio implements specialized algorithmic adaptations for long-read sequencing with elevated error rates dominated by short indels:
MultiStateAligner9PacBio Engine
Uses MSA_TYPE="MultiStateAligner9PacBio" with PacBio-calibrated scoring matrices optimized for short indel error patterns. Implements reduced deletion penalties (POINTS_DEL=-292 vs -472 standard) and progressive gap penalty scaling across five cost levels (LIMIT_FOR_COST_3=5, LIMIT_FOR_COST_4=20, LIMIT_FOR_COST_5=80) to accommodate the characteristic indel distributions of long-read technologies. Matrix allocation uses KillSwitch.allocInt3D(3, maxRows+1, maxColumns+1) for contiguous memory layout optimizing cache performance.
Error Weight Profile Calibration
The specialized error weight profile accounts for long-read sequencing characteristics:
- Short Indel Dominance: Error patterns dominated by 1-3bp insertions and deletions rather than substitutions
- Context Dependency: Systematic biases in homopolymer regions and repetitive sequences
- Quality Variation: Variable base-calling confidence along read length requiring adaptive scoring
- Structural Complexity: Capability to detect large structural variants through extended maxindel settings
K-mer Seeding Strategy
Implements adaptive k-mer density scaling with 12-mer seeds (keylen=12) and enhanced density parameters: keyDensity=3.5, maxKeyDensity=4.5, minKeyDensity increased 2.5-fold. The maxDesiredKeys=63 accommodates sparse reliable seed distribution in error-prone regions while maintaining computational efficiency through AbstractIndex.MIN_APPROX_HITS_TO_KEEP=1.
Sensitivity Mode Implementations
Fast Mode (fast=t)
- Reduces tip search distance by 5-fold (typically 15→3) for terminal indel detection
- Sets bandwidth ratio (bwr=0.16) for alignment matrix optimization
- Enables quickmatch acceleration for exact sequence matching
- Configures rescue parameters: maxsites=5 primary, maxsites2=400 secondary
- Increases genome exclusion fraction by 1.25× via BBIndexPacBio exclusion settings
- Reduces k-mer densities by 0.9× factor for computational efficiency
Very Slow Mode (vslow=t)
- Extends tip search distance to (15×3)/2 = 22 for comprehensive terminal analysis
- Sets minratio=0.22 for ultra-high sensitivity (22% identity acceptance)
- Disables quality filtering (usequality=f) to prevent quality-based alignment rejection
- Extended rescue parameters: mismatches=50, distance=2500 for distant mate recovery
- Eliminates genome exclusion (0%) via BBIndexPacBio.setFractionToExclude(0)
- Doubles alignment padding with additional margin: SLOW_ALIGN_PADDING×2+8, SLOW_RESCUE_PADDING×2+2
- Increases k-mer densities by 2.5× for maximum seed coverage
Genome Size Optimization
Implements dynamic parameter scaling based on reference genome characteristics:
- Bacterial Genomes (<30Mb): Exclusion fraction reduced to 50%, increased hit reduction parameters, optimized HIT_REDUCTION_DIV settings
- Fungal Genomes (30-100Mb): Exclusion fraction reduced to 60% for balanced sensitivity/performance
- Plant Genomes (100-300Mb): Exclusion fraction reduced to 75% maintaining alignment quality
- Mammalian Genomes (>300Mb): Standard exclusion parameters for computational tractability
Memory Architecture
Long-read alignment requires specialized memory management due to extended alignment matrices:
- Thread Memory: 680MB baseline per thread via adjustThreadsforMemory(680) for alignment matrix storage
- Compression Optimization: ReadWrite.ZIPLEVEL=2 with bgzip preference balancing throughput and storage
- Buffer Management: Shared.capBufferLen(20) prevents memory fragmentation in long-read processing
- Reference Handling: FastaToChromArrays2.MID_PADDING=2000 for efficient chromosome array layout
- Output Control: MAX_SITESCORES_TO_PRINT=100 limits memory usage during alignment scoring
HPC Environment Support
MapPacBio includes automatic environment configuration for NERSC high-performance computing systems:
NERSC System Detection and Module Loading
- Genepool: Automatic loading of oracle-jdk/1.8_144_64bit, samtools/1.4, and pigz modules
- Denovo: Java 1.8.0_144, PrgEnv-gnu/7.1, samtools/1.4, and pigz with environment optimization
- Cori: Specialized module path configuration (/global/common/software/m342/nersc-builds/denovo/) with Java 1.8.0_144 and PrgEnv-gnu/7.1
- Shifter Container: Runtime detection via SHIFTER_RUNTIME=1 with container-appropriate settings
Manual HPC Configuration
For non-NERSC systems, manually configure environment variables:
export JAVA_HOME=/path/to/java8
export PATH=$JAVA_HOME/bin:$PATH
module load samtools pigz # if available
mapPacBio.sh ref=genome.fa in=reads.fq out=aligned.sam
Performance Characteristics
Computational Requirements
- Memory Allocation: ~6 bytes per reference base plus 680MB per thread for alignment matrices (vs 500MB for standard BBMap)
- Threading Scalability: Near-linear scaling with processor cores, memory bandwidth often becomes limiting factor
- Alignment Sensitivity: Configured for 46-99% identity alignments through MINIMUM_ALIGNMENT_SCORE_RATIO=0.46f
- Index Memory: Supports modulo reduction for 50% memory savings with minimal sensitivity impact
- I/O Optimization: Bgzip compression (ZIPLEVEL=2) with buffer capping for balanced throughput/memory usage
Performance Scaling
- Bacterial Genomes (5Mb): ~2-5 minutes on 8-core system with 16GB RAM
- Fungal Genomes (50Mb): ~10-30 minutes on 16-core system with 32GB RAM
- Plant Genomes (500Mb): ~1-3 hours on 32-core system with 128GB RAM
- Mammalian Genomes (3Gb): ~3-8 hours on 64-core system with 256GB RAM
Memory vs Sensitivity Trade-offs
MapPacBio provides several options for managing memory usage:
- Standard Mode: Full sensitivity with ~6 bytes per reference base index
- Modulo Mode (usemodulo=t): 50% memory reduction with <1% sensitivity loss
- Fast Mode (fast=t): 25-40% speed increase with 10-15% sensitivity reduction
- In-Memory (nodisk=t): No disk I/O overhead when sufficient RAM available
Comparison with Standard BBMap
Feature | BBMap (Standard) | MapPacBio | Technical Rationale |
---|---|---|---|
Maximum Read Length | 700bp | 6000bp (6kbp) | Accommodates PacBio CLR and Nanopore read distributions |
Minimum Score Ratio | 0.76 (76% identity) | 0.46 (46% identity) | Accommodates 10-15% error rates typical of long-read sequencing |
K-mer Length | 13bp | 12bp | Balances sensitivity and specificity for noisy sequences |
Key Density | 2.0 | 3.5 | Increased seed density for error-prone region coverage |
Max Key Density | 3.0 | 4.5 | Allows adaptive density scaling in variable-quality regions |
Min Key Density | 1.0 | 2.5× baseline | Ensures minimum seed coverage in high-error regions |
Max Desired Keys | 40 | 63 | Accommodates sparser reliable seed distribution |
Tip Search Distance | 100bp | 15bp | Optimized for short indel error patterns vs structural variants |
Alignment Engine | MultiStateAligner11ts | MultiStateAligner9PacBio | PacBio-specific gap penalty optimization and scoring matrices |
Memory per Thread | ~500MB | 680MB | Increased allocation for larger alignment matrices and long-read processing |
Error Profile Target | Illumina (2-3% error) | PacBio/Nanopore (10-15% error) | Specialized weight profiles for respective sequencing technologies |
Primary Error Type | Substitutions | Short indels (1-3bp) | Reflects dominant error mechanisms of long-read sequencing |
When to Use MapPacBio vs Standard BBMap
Use MapPacBio For:
- PacBio Data: All PacBio CLR and HiFi sequencing data regardless of quality
- Nanopore Data: Oxford Nanopore R9, R10, and newer chemistry data
- Long Reads (>1kb): Any sequencing technology producing reads longer than 1kbp
- High Error Rate Data: Sequencing with >5% error rates dominated by indels
- Structural Variant Detection: When detecting large indels or structural rearrangements
- Legacy Long-Read Technologies: 454, Sanger, or Ion Torrent long reads
Use Standard BBMap For:
- Illumina Data: All Illumina sequencing platforms (HiSeq, NovaSeq, MiSeq)
- Short Reads (<500bp): Read lengths under 500bp regardless of technology
- Low Error Rate Data: High-quality data with <5% error rates
- RNA-seq Applications: Splice-aware alignment of short RNA-seq reads
- Metagenomics: Short-read metagenomic analysis and binning
Performance Considerations
Note that globally optimal alignments may occasionally differ between MapPacBio and standard BBMap when processing the same Illumina data due to the specialized scoring matrices and error models. For consistency in comparative analyses, use the same aligner variant across all samples in a study.
Troubleshooting
Common Issues and Solutions
- Out of Memory Errors
- MapPacBio requires 680MB per thread plus index memory (~6 bytes per reference base). Increase heap size (-Xmx32g), reduce thread count, or enable modulo mode (usemodulo=t) for 50% memory reduction. For very large genomes, use nodisk=t if sufficient RAM available.
- Low Mapping Rates (<70%)
- Enable vslow mode for maximum sensitivity (minratio=0.22). For highly divergent sequences, manually reduce minratio (e.g., minratio=0.20). Verify input data quality - excessive adapter content or contamination can reduce mapping rates. Check if reads are PacBio CLR vs HiFi; high-quality HiFi reads may benefit from higher minratio settings.
- Excessive Runtime (>2 hours for bacterial genomes)
- Enable fast mode (fast=t) to reduce tip search and limit candidate sites. Increase minhits=2-3 to reduce alignment candidates. Set maxindel=1000-5000 if long structural variants aren't expected. Ensure adequate RAM to prevent memory swapping which severely impacts performance.
- High Memory Usage During Alignment
- Long reads require extended alignment matrices consuming 680MB per thread. Reduce thread count if approaching system memory limits. Enable usemodulo during both indexing and mapping for 50% memory reduction. Monitor swap usage - any swapping will degrade performance substantially.
- NERSC Module Loading Failures
- HPC environments may have different module names or unavailable versions. Override automatic detection by setting NERSC_HOST="" to disable module loading. Manually configure JAVA_HOME and PATH variables. Verify Java 1.8 compatibility for optimal performance.
- Index Building Failures
- Ensure adequate temporary disk space (3-5× genome size) for index construction. Check chrombits parameter for references with very long scaffolds (>100Mb). Use nodisk=t if sufficient RAM available (>10× genome size) but limited temporary storage.
- Poor Performance on Nanopore Data
- Ensure using mapPacBio.sh rather than standard BBMap. For R10+ high-accuracy Nanopore data, consider increasing minratio=0.55-0.65 for improved specificity. Very long reads (>20kb) may benefit from increased maxindel settings and reduced minhits values.
- Threading Performance Issues
- MapPacBio auto-adjusts threads based on memory availability (680MB per thread). Override with threads=N if automatic detection is suboptimal. Single-thread mode (threads=1) useful for debugging alignment issues. Memory bandwidth often limits scaling beyond 16-32 threads for large genomes.
Related Tools
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Complete BBMap Guide: BBMapGuide.txt