MapPacBio

Description

MapPacBio is a BBMap variant specifically optimized for long-read sequencing technologies. While standard BBMap supports reads up to 700bp, MapPacBio extends support to reads up to 6kbp and employs a specialized error weight profile designed for long reads with high error rates (10-15%) dominated by short indels. This variant is the recommended version for both PacBio and Nanopore data.

The tool uses the MultiStateAligner9PacBio alignment engine with PacBio-tuned scoring matrices and implements a minimum alignment score ratio of 0.46 (allowing ~46% identity alignments) to accommodate the characteristic error profiles of long-read technologies. Note that globally optimal alignments may occasionally differ between MapPacBio and standard BBMap when processing Illumina data due to these specialized optimizations.

MapPacBio includes automatic HPC environment configuration for NERSC systems and uses memory-optimized threading with 680MB per-thread baseline allocation to handle the increased computational requirements of longer alignment matrices.

Technology Support

Primary Target Platforms

PacBio CLR/HiFi: Optimized for characteristic short indel error patterns and variable quality profiles
Oxford Nanopore: Recommended variant for Nanopore R9/R10 chemistry with appropriate error tolerance
Legacy Long Reads: Supports 454, Sanger, and Ion Torrent long-read chemistries

Error Profile Optimization

MapPacBio's error weight profile is specifically calibrated for sequencing technologies with:

High overall error rates (10-15% vs 2-3% for Illumina)
Error distributions dominated by short insertions and deletions
Variable quality along read length with systematic base-calling biases
Context-dependent error patterns requiring specialized gap penalty structures

Basic Usage

mapPacBio.sh ref=reference.fa in=pacbio_reads.fq out=mapped.sam

mapPacBio.sh build=1 in=nanopore_reads.fq out=mapped.sam

mapPacBio.sh ref=genome.fa in=long_reads.fq out=mapped.sam nodisk

PacBio-Specific Parameter Configuration

MapPacBio applies several default parameter modifications optimized for long-read characteristics:

Read Length and Processing

fastareadlen=6000: Extended from 500bp to support PacBio/Nanopore read lengths up to 6kbp
minscaf=100: Minimum scaffold length for alignment consideration
overwrite=true: Default file overwriting enabled for iterative analysis workflows

Error Tolerance Adjustments

minratio=0.40: Reduced from 0.76 to accommodate 10-15% error rates typical of long-read sequencing
MINIMUM_ALIGNMENT_SCORE_RATIO=0.46f: Internal Java setting allowing ~46% identity alignments
ambiguous=best: Optimized handling of ambiguously-mapped reads in high-error contexts

K-mer and Indexing Optimizations

keylen=12: Reduced from 13 to balance sensitivity and specificity for noisy sequences
keyDensity=3.5: Increased from 2.0 to improve seed coverage in error-prone regions
maxKeyDensity=4.5: Upper bound allowing adaptive density scaling
minKeyDensity×2.5: Ensures minimum seed coverage even in highly variable regions
maxDesiredKeys=63: Increased from 40 to accommodate sparser reliable seed distribution

Memory and Threading Architecture

startpad=10000, stoppad=10000, midpad=6000: Sequence padding optimizations for long-read processing
FastaToChromArrays2.MID_PADDING=2000: Reference handling optimization
adjustThreadsforMemory(680): 680MB per-thread allocation for extended alignment matrices
Shared.capBufferLen(20): Buffer management to prevent memory fragmentation

Parameters

MapPacBio accepts all standard BBMap parameters with PacBio-optimized defaults. Key parameters for long-read alignment are listed below. For comprehensive parameter documentation, see the BBMap documentation.

Indexing Parameters

ref=<file>: Reference sequence file for index construction. Required when building new indices.
build=1: Build identifier for reference index. Each reference requires a unique build number.
k=13: Kmer length for indexing (default 13 for BBMap, optimized to 12 for PacBio/Nanopore).
nodisk=f: Build index in memory without disk storage. Useful for temporary analyses or limited disk space.

Input Parameters

in=<file>: Primary reads input file (FASTA/FASTQ, compressed or uncompressed). Required parameter.
in2=<file>: Second file for paired reads in separate files (uncommon for PacBio/Nanopore).
interleaved=auto: Force paired/interleaved input processing (true) or single-ended mapping (false).
fastareadlen=6000: Fragment FASTA reads longer than this value. Default 6000bp for long-read support (vs 500bp standard).

Mapping Parameters

minratio=0.40: Minimum alignment score ratio for read acceptance. Reduced from 0.76 to accommodate long-read error rates.
maxindel=16000: Maximum indel length for explicit search. PacBio/Nanopore can have structural variants exceeding read length.
tipsearch=15: Search distance for terminal deletions with short anchors. Optimized for long-read end effects.
minhits=1: Minimum seed hits required for candidate site consideration. Reduced for error-prone sequences.
ambiguous=best: Handling of ambiguously-mapped reads. Options: best (first), toss, random, all.
fast=f: Enable faster, less sensitive alignment settings with reduced computational overhead.
slow=f: Enable slower, more sensitive settings. Use 'vslow' for maximum sensitivity.
vslow=f: Very slow mode with maximum sensitivity (minratio=0.22, extended rescue parameters).

Output Parameters

out=<file>: Primary output file for all processed reads (SAM/BAM format recommended).
outm=<file>: Output file for successfully mapped reads only.
outu=<file>: Output file for unmapped reads only.
overwrite=true: Allow overwriting existing output files. Default enabled for MapPacBio workflows.
secondary=f: Include secondary alignments in output (increases file size significantly).

Quality and Trimming Parameters

qtrim=f: Quality-based read trimming before mapping. Options: f (none), l (left), r (right), lr (both).
trimq=6: Quality threshold for trimming operations (Phred scale).
mintrimlength=60: Minimum read length after trimming to prevent over-trimming artifacts.
minaveragequality=0: Minimum average read quality for mapping consideration (0 disables filter).

Performance Parameters

threads=auto: Threading level (auto-detected based on available cores and memory constraints).
-Xmx<size>: Java heap memory allocation. Long-read alignment typically requires increased memory allocation.
startpad=10000: Padding at reference sequence starts (optimized for long-read boundary effects).
stoppad=10000: Padding at reference sequence ends (optimized for long-read boundary effects).
midpad=6000: Inter-sequence padding in concatenated references (long-read specific optimization).

Examples

Basic PacBio/Nanopore Alignment

mapPacBio.sh ref=genome.fa in=long_reads.fq out=aligned.sam

Standard long-read alignment with PacBio-optimized defaults for error tolerance and read length support.

Using Pre-built Index

mapPacBio.sh build=1 in=pacbio_reads.fq out=aligned.sam

Map to previously indexed reference (build=1) to avoid re-indexing overhead.

Maximum Sensitivity Alignment

mapPacBio.sh in=reads.fq out=mapped.sam vslow k=8 maxindel=200 minratio=0.1

Super-high sensitivity mode for difficult alignments, highly divergent sequences, or contamination screening. Uses 8-mer seeds and accepts alignments with only 10% identity.

Fast Mode for Large Datasets

mapPacBio.sh ref=genome.fa in=reads.fq out=aligned.sam fast

Accelerated processing for large datasets with acceptable sensitivity trade-offs (5× reduction in tip search).

Memory-Constrained Environment

mapPacBio.sh ref=genome.fa in=reads.fq out=aligned.sam -Xmx16g nodisk

Limited memory allocation with in-memory indexing when disk space is constrained.

High-Quality Nanopore Reads

mapPacBio.sh ref=genome.fa in=nanopore_hq.fq out=aligned.sam minratio=0.60

Higher stringency alignment for high-quality Nanopore data (R10+ chemistry) with reduced error rates.

Algorithm Details

MapPacBio implements specialized algorithmic adaptations for long-read sequencing with elevated error rates dominated by short indels:

MultiStateAligner9PacBio Engine

Uses MSA_TYPE="MultiStateAligner9PacBio" with PacBio-calibrated scoring matrices optimized for short indel error patterns. Implements reduced deletion penalties (POINTS_DEL=-292 vs -472 standard) and progressive gap penalty scaling across five cost levels (LIMIT_FOR_COST_3=5, LIMIT_FOR_COST_4=20, LIMIT_FOR_COST_5=80) to accommodate the characteristic indel distributions of long-read technologies. Matrix allocation uses KillSwitch.allocInt3D(3, maxRows+1, maxColumns+1) for contiguous memory layout optimizing cache performance.

Error Weight Profile Calibration

The specialized error weight profile accounts for long-read sequencing characteristics:

Short Indel Dominance: Error patterns dominated by 1-3bp insertions and deletions rather than substitutions
Context Dependency: Systematic biases in homopolymer regions and repetitive sequences
Quality Variation: Variable base-calling confidence along read length requiring adaptive scoring
Structural Complexity: Capability to detect large structural variants through extended maxindel settings

K-mer Seeding Strategy

Implements adaptive k-mer density scaling with 12-mer seeds (keylen=12) and enhanced density parameters: keyDensity=3.5, maxKeyDensity=4.5, minKeyDensity increased 2.5-fold. The maxDesiredKeys=63 accommodates sparse reliable seed distribution in error-prone regions while maintaining computational efficiency through AbstractIndex.MIN_APPROX_HITS_TO_KEEP=1.

Sensitivity Mode Implementations

Fast Mode (fast=t)

Reduces tip search distance by 5-fold (typically 15→3) for terminal indel detection
Sets bandwidth ratio (bwr=0.16) for alignment matrix optimization
Enables quickmatch acceleration for exact sequence matching
Configures rescue parameters: maxsites=5 primary, maxsites2=400 secondary
Increases genome exclusion fraction by 1.25× via BBIndexPacBio exclusion settings
Reduces k-mer densities by 0.9× factor for computational efficiency

Very Slow Mode (vslow=t)

Extends tip search distance to (15×3)/2 = 22 for comprehensive terminal analysis
Sets minratio=0.22 for ultra-high sensitivity (22% identity acceptance)
Disables quality filtering (usequality=f) to prevent quality-based alignment rejection
Extended rescue parameters: mismatches=50, distance=2500 for distant mate recovery
Eliminates genome exclusion (0%) via BBIndexPacBio.setFractionToExclude(0)
Doubles alignment padding with additional margin: SLOW_ALIGN_PADDING×2+8, SLOW_RESCUE_PADDING×2+2
Increases k-mer densities by 2.5× for maximum seed coverage

Genome Size Optimization

Implements dynamic parameter scaling based on reference genome characteristics:

Bacterial Genomes (<30Mb): Exclusion fraction reduced to 50%, increased hit reduction parameters, optimized HIT_REDUCTION_DIV settings
Fungal Genomes (30-100Mb): Exclusion fraction reduced to 60% for balanced sensitivity/performance
Plant Genomes (100-300Mb): Exclusion fraction reduced to 75% maintaining alignment quality
Mammalian Genomes (>300Mb): Standard exclusion parameters for computational tractability

Memory Architecture

Long-read alignment requires specialized memory management due to extended alignment matrices:

Thread Memory: 680MB baseline per thread via adjustThreadsforMemory(680) for alignment matrix storage
Compression Optimization: ReadWrite.ZIPLEVEL=2 with bgzip preference balancing throughput and storage
Buffer Management: Shared.capBufferLen(20) prevents memory fragmentation in long-read processing
Reference Handling: FastaToChromArrays2.MID_PADDING=2000 for efficient chromosome array layout
Output Control: MAX_SITESCORES_TO_PRINT=100 limits memory usage during alignment scoring

HPC Environment Support

MapPacBio includes automatic environment configuration for NERSC high-performance computing systems:

NERSC System Detection and Module Loading

Genepool: Automatic loading of oracle-jdk/1.8_144_64bit, samtools/1.4, and pigz modules
Denovo: Java 1.8.0_144, PrgEnv-gnu/7.1, samtools/1.4, and pigz with environment optimization
Cori: Specialized module path configuration (/global/common/software/m342/nersc-builds/denovo/) with Java 1.8.0_144 and PrgEnv-gnu/7.1
Shifter Container: Runtime detection via SHIFTER_RUNTIME=1 with container-appropriate settings

Manual HPC Configuration

For non-NERSC systems, manually configure environment variables:

export JAVA_HOME=/path/to/java8
export PATH=$JAVA_HOME/bin:$PATH
module load samtools pigz  # if available
mapPacBio.sh ref=genome.fa in=reads.fq out=aligned.sam

Performance Characteristics

Computational Requirements

Memory Allocation: ~6 bytes per reference base plus 680MB per thread for alignment matrices (vs 500MB for standard BBMap)
Threading Scalability: Near-linear scaling with processor cores, memory bandwidth often becomes limiting factor
Alignment Sensitivity: Configured for 46-99% identity alignments through MINIMUM_ALIGNMENT_SCORE_RATIO=0.46f
Index Memory: Supports modulo reduction for 50% memory savings with minimal sensitivity impact
I/O Optimization: Bgzip compression (ZIPLEVEL=2) with buffer capping for balanced throughput/memory usage

Performance Scaling

Bacterial Genomes (5Mb): ~2-5 minutes on 8-core system with 16GB RAM
Fungal Genomes (50Mb): ~10-30 minutes on 16-core system with 32GB RAM
Plant Genomes (500Mb): ~1-3 hours on 32-core system with 128GB RAM
Mammalian Genomes (3Gb): ~3-8 hours on 64-core system with 256GB RAM

Memory vs Sensitivity Trade-offs

MapPacBio provides several options for managing memory usage:

Standard Mode: Full sensitivity with ~6 bytes per reference base index
Modulo Mode (usemodulo=t): 50% memory reduction with <1% sensitivity loss
Fast Mode (fast=t): 25-40% speed increase with 10-15% sensitivity reduction
In-Memory (nodisk=t): No disk I/O overhead when sufficient RAM available

Comparison with Standard BBMap

Feature	BBMap (Standard)	MapPacBio	Technical Rationale
Maximum Read Length	700bp	6000bp (6kbp)	Accommodates PacBio CLR and Nanopore read distributions
Minimum Score Ratio	0.76 (76% identity)	0.46 (46% identity)	Accommodates 10-15% error rates typical of long-read sequencing
K-mer Length	13bp	12bp	Balances sensitivity and specificity for noisy sequences
Key Density	2.0	3.5	Increased seed density for error-prone region coverage
Max Key Density	3.0	4.5	Allows adaptive density scaling in variable-quality regions
Min Key Density	1.0	2.5× baseline	Ensures minimum seed coverage in high-error regions
Max Desired Keys	40	63	Accommodates sparser reliable seed distribution
Tip Search Distance	100bp	15bp	Optimized for short indel error patterns vs structural variants
Alignment Engine	MultiStateAligner11ts	MultiStateAligner9PacBio	PacBio-specific gap penalty optimization and scoring matrices
Memory per Thread	~500MB	680MB	Increased allocation for larger alignment matrices and long-read processing
Error Profile Target	Illumina (2-3% error)	PacBio/Nanopore (10-15% error)	Specialized weight profiles for respective sequencing technologies
Primary Error Type	Substitutions	Short indels (1-3bp)	Reflects dominant error mechanisms of long-read sequencing

When to Use MapPacBio vs Standard BBMap

Use MapPacBio For:

PacBio Data: All PacBio CLR and HiFi sequencing data regardless of quality
Nanopore Data: Oxford Nanopore R9, R10, and newer chemistry data
Long Reads (>1kb): Any sequencing technology producing reads longer than 1kbp
High Error Rate Data: Sequencing with >5% error rates dominated by indels
Structural Variant Detection: When detecting large indels or structural rearrangements
Legacy Long-Read Technologies: 454, Sanger, or Ion Torrent long reads

Use Standard BBMap For:

Illumina Data: All Illumina sequencing platforms (HiSeq, NovaSeq, MiSeq)
Short Reads (<500bp): Read lengths under 500bp regardless of technology
Low Error Rate Data: High-quality data with <5% error rates
RNA-seq Applications: Splice-aware alignment of short RNA-seq reads
Metagenomics: Short-read metagenomic analysis and binning

Performance Considerations

Note that globally optimal alignments may occasionally differ between MapPacBio and standard BBMap when processing the same Illumina data due to the specialized scoring matrices and error models. For consistency in comparative analyses, use the same aligner variant across all samples in a study.

Troubleshooting

Common Issues and Solutions

Out of Memory Errors: MapPacBio requires 680MB per thread plus index memory (~6 bytes per reference base). Increase heap size (-Xmx32g), reduce thread count, or enable modulo mode (usemodulo=t) for 50% memory reduction. For very large genomes, use nodisk=t if sufficient RAM available.
Low Mapping Rates (<70%): Enable vslow mode for maximum sensitivity (minratio=0.22). For highly divergent sequences, manually reduce minratio (e.g., minratio=0.20). Verify input data quality - excessive adapter content or contamination can reduce mapping rates. Check if reads are PacBio CLR vs HiFi; high-quality HiFi reads may benefit from higher minratio settings.
Excessive Runtime (>2 hours for bacterial genomes): Enable fast mode (fast=t) to reduce tip search and limit candidate sites. Increase minhits=2-3 to reduce alignment candidates. Set maxindel=1000-5000 if long structural variants aren't expected. Ensure adequate RAM to prevent memory swapping which severely impacts performance.
High Memory Usage During Alignment: Long reads require extended alignment matrices consuming 680MB per thread. Reduce thread count if approaching system memory limits. Enable usemodulo during both indexing and mapping for 50% memory reduction. Monitor swap usage - any swapping will degrade performance substantially.
NERSC Module Loading Failures: HPC environments may have different module names or unavailable versions. Override automatic detection by setting NERSC_HOST="" to disable module loading. Manually configure JAVA_HOME and PATH variables. Verify Java 1.8 compatibility for optimal performance.
Index Building Failures: Ensure adequate temporary disk space (3-5× genome size) for index construction. Check chrombits parameter for references with very long scaffolds (>100Mb). Use nodisk=t if sufficient RAM available (>10× genome size) but limited temporary storage.
Poor Performance on Nanopore Data: Ensure using mapPacBio.sh rather than standard BBMap. For R10+ high-accuracy Nanopore data, consider increasing minratio=0.55-0.65 for improved specificity. Very long reads (>20kb) may benefit from increased maxindel settings and reduced minhits values.
Threading Performance Issues: MapPacBio auto-adjusts threads based on memory availability (680MB per thread). Override with threads=N if automatic detection is suboptimal. Single-thread mode (threads=1) useful for debugging alignment issues. Memory bandwidth often limits scaling beyond 16-32 threads for large genomes.

Related Tools

BBMap: Standard version optimized for Illumina short reads
BBSplit: Multi-reference mapping and contamination detection
BBWrap: Wrapper for batch processing multiple input files
Pileup: Coverage analysis and variant calling from MapPacBio output
Reformat: SAM/BAM processing and format conversion

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Complete BBMap Guide: BBMapGuide.txt