MapPacBio

Script: mapPacBio.sh Package: align2 Class: BBMapPacBio.java

BBMap variant optimized for PacBio long reads and Nanopore sequencing data, featuring specialized error weight profiles for high error rates dominated by short indels

Description

MapPacBio is a BBMap variant specifically optimized for long-read sequencing technologies. While standard BBMap supports reads up to 700bp, MapPacBio extends support to reads up to 6kbp and employs a specialized error weight profile designed for long reads with high error rates (10-15%) dominated by short indels. This variant is the recommended version for both PacBio and Nanopore data.

The tool uses the MultiStateAligner9PacBio alignment engine with PacBio-tuned scoring matrices and implements a minimum alignment score ratio of 0.46 (allowing ~46% identity alignments) to accommodate the characteristic error profiles of long-read technologies. Note that globally optimal alignments may occasionally differ between MapPacBio and standard BBMap when processing Illumina data due to these specialized optimizations.

MapPacBio includes automatic HPC environment configuration for NERSC systems and uses memory-optimized threading with 680MB per-thread baseline allocation to handle the increased computational requirements of longer alignment matrices.

Technology Support

Primary Target Platforms

  • PacBio CLR/HiFi: Optimized for characteristic short indel error patterns and variable quality profiles
  • Oxford Nanopore: Recommended variant for Nanopore R9/R10 chemistry with appropriate error tolerance
  • Legacy Long Reads: Supports 454, Sanger, and Ion Torrent long-read chemistries

Error Profile Optimization

MapPacBio's error weight profile is specifically calibrated for sequencing technologies with:

  • High overall error rates (10-15% vs 2-3% for Illumina)
  • Error distributions dominated by short insertions and deletions
  • Variable quality along read length with systematic base-calling biases
  • Context-dependent error patterns requiring specialized gap penalty structures

Basic Usage

mapPacBio.sh ref=reference.fa in=pacbio_reads.fq out=mapped.sam
mapPacBio.sh build=1 in=nanopore_reads.fq out=mapped.sam
mapPacBio.sh ref=genome.fa in=long_reads.fq out=mapped.sam nodisk

PacBio-Specific Parameter Configuration

MapPacBio applies several default parameter modifications optimized for long-read characteristics:

Read Length and Processing

  • fastareadlen=6000: Extended from 500bp to support PacBio/Nanopore read lengths up to 6kbp
  • minscaf=100: Minimum scaffold length for alignment consideration
  • overwrite=true: Default file overwriting enabled for iterative analysis workflows

Error Tolerance Adjustments

  • minratio=0.40: Reduced from 0.76 to accommodate 10-15% error rates typical of long-read sequencing
  • MINIMUM_ALIGNMENT_SCORE_RATIO=0.46f: Internal Java setting allowing ~46% identity alignments
  • ambiguous=best: Optimized handling of ambiguously-mapped reads in high-error contexts

K-mer and Indexing Optimizations

  • keylen=12: Reduced from 13 to balance sensitivity and specificity for noisy sequences
  • keyDensity=3.5: Increased from 2.0 to improve seed coverage in error-prone regions
  • maxKeyDensity=4.5: Upper bound allowing adaptive density scaling
  • minKeyDensity×2.5: Ensures minimum seed coverage even in highly variable regions
  • maxDesiredKeys=63: Increased from 40 to accommodate sparser reliable seed distribution

Memory and Threading Architecture

  • startpad=10000, stoppad=10000, midpad=6000: Sequence padding optimizations for long-read processing
  • FastaToChromArrays2.MID_PADDING=2000: Reference handling optimization
  • adjustThreadsforMemory(680): 680MB per-thread allocation for extended alignment matrices
  • Shared.capBufferLen(20): Buffer management to prevent memory fragmentation

Parameters

MapPacBio accepts all standard BBMap parameters with PacBio-optimized defaults. Key parameters for long-read alignment are listed below. For comprehensive parameter documentation, see the BBMap documentation.

Indexing Parameters

ref=<file>
Reference sequence file for index construction. Required when building new indices.
build=1
Build identifier for reference index. Each reference requires a unique build number.
k=13
Kmer length for indexing (default 13 for BBMap, optimized to 12 for PacBio/Nanopore).
nodisk=f
Build index in memory without disk storage. Useful for temporary analyses or limited disk space.

Input Parameters

in=<file>
Primary reads input file (FASTA/FASTQ, compressed or uncompressed). Required parameter.
in2=<file>
Second file for paired reads in separate files (uncommon for PacBio/Nanopore).
interleaved=auto
Force paired/interleaved input processing (true) or single-ended mapping (false).
fastareadlen=6000
Fragment FASTA reads longer than this value. Default 6000bp for long-read support (vs 500bp standard).

Mapping Parameters

minratio=0.40
Minimum alignment score ratio for read acceptance. Reduced from 0.76 to accommodate long-read error rates.
maxindel=16000
Maximum indel length for explicit search. PacBio/Nanopore can have structural variants exceeding read length.
tipsearch=15
Search distance for terminal deletions with short anchors. Optimized for long-read end effects.
minhits=1
Minimum seed hits required for candidate site consideration. Reduced for error-prone sequences.
ambiguous=best
Handling of ambiguously-mapped reads. Options: best (first), toss, random, all.
fast=f
Enable faster, less sensitive alignment settings with reduced computational overhead.
slow=f
Enable slower, more sensitive settings. Use 'vslow' for maximum sensitivity.
vslow=f
Very slow mode with maximum sensitivity (minratio=0.22, extended rescue parameters).

Output Parameters

out=<file>
Primary output file for all processed reads (SAM/BAM format recommended).
outm=<file>
Output file for successfully mapped reads only.
outu=<file>
Output file for unmapped reads only.
overwrite=true
Allow overwriting existing output files. Default enabled for MapPacBio workflows.
secondary=f
Include secondary alignments in output (increases file size significantly).

Quality and Trimming Parameters

qtrim=f
Quality-based read trimming before mapping. Options: f (none), l (left), r (right), lr (both).
trimq=6
Quality threshold for trimming operations (Phred scale).
mintrimlength=60
Minimum read length after trimming to prevent over-trimming artifacts.
minaveragequality=0
Minimum average read quality for mapping consideration (0 disables filter).

Performance Parameters

threads=auto
Threading level (auto-detected based on available cores and memory constraints).
-Xmx<size>
Java heap memory allocation. Long-read alignment typically requires increased memory allocation.
startpad=10000
Padding at reference sequence starts (optimized for long-read boundary effects).
stoppad=10000
Padding at reference sequence ends (optimized for long-read boundary effects).
midpad=6000
Inter-sequence padding in concatenated references (long-read specific optimization).

Examples

Basic PacBio/Nanopore Alignment

mapPacBio.sh ref=genome.fa in=long_reads.fq out=aligned.sam

Standard long-read alignment with PacBio-optimized defaults for error tolerance and read length support.

Using Pre-built Index

mapPacBio.sh build=1 in=pacbio_reads.fq out=aligned.sam

Map to previously indexed reference (build=1) to avoid re-indexing overhead.

Maximum Sensitivity Alignment

mapPacBio.sh in=reads.fq out=mapped.sam vslow k=8 maxindel=200 minratio=0.1

Super-high sensitivity mode for difficult alignments, highly divergent sequences, or contamination screening. Uses 8-mer seeds and accepts alignments with only 10% identity.

Fast Mode for Large Datasets

mapPacBio.sh ref=genome.fa in=reads.fq out=aligned.sam fast

Accelerated processing for large datasets with acceptable sensitivity trade-offs (5× reduction in tip search).

Memory-Constrained Environment

mapPacBio.sh ref=genome.fa in=reads.fq out=aligned.sam -Xmx16g nodisk

Limited memory allocation with in-memory indexing when disk space is constrained.

High-Quality Nanopore Reads

mapPacBio.sh ref=genome.fa in=nanopore_hq.fq out=aligned.sam minratio=0.60

Higher stringency alignment for high-quality Nanopore data (R10+ chemistry) with reduced error rates.

Algorithm Details

MapPacBio implements specialized algorithmic adaptations for long-read sequencing with elevated error rates dominated by short indels:

MultiStateAligner9PacBio Engine

Uses MSA_TYPE="MultiStateAligner9PacBio" with PacBio-calibrated scoring matrices optimized for short indel error patterns. Implements reduced deletion penalties (POINTS_DEL=-292 vs -472 standard) and progressive gap penalty scaling across five cost levels (LIMIT_FOR_COST_3=5, LIMIT_FOR_COST_4=20, LIMIT_FOR_COST_5=80) to accommodate the characteristic indel distributions of long-read technologies. Matrix allocation uses KillSwitch.allocInt3D(3, maxRows+1, maxColumns+1) for contiguous memory layout optimizing cache performance.

Error Weight Profile Calibration

The specialized error weight profile accounts for long-read sequencing characteristics:

K-mer Seeding Strategy

Implements adaptive k-mer density scaling with 12-mer seeds (keylen=12) and enhanced density parameters: keyDensity=3.5, maxKeyDensity=4.5, minKeyDensity increased 2.5-fold. The maxDesiredKeys=63 accommodates sparse reliable seed distribution in error-prone regions while maintaining computational efficiency through AbstractIndex.MIN_APPROX_HITS_TO_KEEP=1.

Sensitivity Mode Implementations

Fast Mode (fast=t)

  • Reduces tip search distance by 5-fold (typically 15→3) for terminal indel detection
  • Sets bandwidth ratio (bwr=0.16) for alignment matrix optimization
  • Enables quickmatch acceleration for exact sequence matching
  • Configures rescue parameters: maxsites=5 primary, maxsites2=400 secondary
  • Increases genome exclusion fraction by 1.25× via BBIndexPacBio exclusion settings
  • Reduces k-mer densities by 0.9× factor for computational efficiency

Very Slow Mode (vslow=t)

  • Extends tip search distance to (15×3)/2 = 22 for comprehensive terminal analysis
  • Sets minratio=0.22 for ultra-high sensitivity (22% identity acceptance)
  • Disables quality filtering (usequality=f) to prevent quality-based alignment rejection
  • Extended rescue parameters: mismatches=50, distance=2500 for distant mate recovery
  • Eliminates genome exclusion (0%) via BBIndexPacBio.setFractionToExclude(0)
  • Doubles alignment padding with additional margin: SLOW_ALIGN_PADDING×2+8, SLOW_RESCUE_PADDING×2+2
  • Increases k-mer densities by 2.5× for maximum seed coverage

Genome Size Optimization

Implements dynamic parameter scaling based on reference genome characteristics:

Memory Architecture

Long-read alignment requires specialized memory management due to extended alignment matrices:

HPC Environment Support

MapPacBio includes automatic environment configuration for NERSC high-performance computing systems:

NERSC System Detection and Module Loading

  • Genepool: Automatic loading of oracle-jdk/1.8_144_64bit, samtools/1.4, and pigz modules
  • Denovo: Java 1.8.0_144, PrgEnv-gnu/7.1, samtools/1.4, and pigz with environment optimization
  • Cori: Specialized module path configuration (/global/common/software/m342/nersc-builds/denovo/) with Java 1.8.0_144 and PrgEnv-gnu/7.1
  • Shifter Container: Runtime detection via SHIFTER_RUNTIME=1 with container-appropriate settings

Manual HPC Configuration

For non-NERSC systems, manually configure environment variables:

export JAVA_HOME=/path/to/java8
export PATH=$JAVA_HOME/bin:$PATH
module load samtools pigz  # if available
mapPacBio.sh ref=genome.fa in=reads.fq out=aligned.sam

Performance Characteristics

Computational Requirements

Performance Scaling

Memory vs Sensitivity Trade-offs

MapPacBio provides several options for managing memory usage:

Comparison with Standard BBMap

FeatureBBMap (Standard)MapPacBioTechnical Rationale
Maximum Read Length700bp6000bp (6kbp)Accommodates PacBio CLR and Nanopore read distributions
Minimum Score Ratio0.76 (76% identity)0.46 (46% identity)Accommodates 10-15% error rates typical of long-read sequencing
K-mer Length13bp12bpBalances sensitivity and specificity for noisy sequences
Key Density2.03.5Increased seed density for error-prone region coverage
Max Key Density3.04.5Allows adaptive density scaling in variable-quality regions
Min Key Density1.02.5× baselineEnsures minimum seed coverage in high-error regions
Max Desired Keys4063Accommodates sparser reliable seed distribution
Tip Search Distance100bp15bpOptimized for short indel error patterns vs structural variants
Alignment EngineMultiStateAligner11tsMultiStateAligner9PacBioPacBio-specific gap penalty optimization and scoring matrices
Memory per Thread~500MB680MBIncreased allocation for larger alignment matrices and long-read processing
Error Profile TargetIllumina (2-3% error)PacBio/Nanopore (10-15% error)Specialized weight profiles for respective sequencing technologies
Primary Error TypeSubstitutionsShort indels (1-3bp)Reflects dominant error mechanisms of long-read sequencing

When to Use MapPacBio vs Standard BBMap

Use MapPacBio For:

Use Standard BBMap For:

Performance Considerations

Note that globally optimal alignments may occasionally differ between MapPacBio and standard BBMap when processing the same Illumina data due to the specialized scoring matrices and error models. For consistency in comparative analyses, use the same aligner variant across all samples in a study.

Troubleshooting

Common Issues and Solutions

Out of Memory Errors
MapPacBio requires 680MB per thread plus index memory (~6 bytes per reference base). Increase heap size (-Xmx32g), reduce thread count, or enable modulo mode (usemodulo=t) for 50% memory reduction. For very large genomes, use nodisk=t if sufficient RAM available.
Low Mapping Rates (<70%)
Enable vslow mode for maximum sensitivity (minratio=0.22). For highly divergent sequences, manually reduce minratio (e.g., minratio=0.20). Verify input data quality - excessive adapter content or contamination can reduce mapping rates. Check if reads are PacBio CLR vs HiFi; high-quality HiFi reads may benefit from higher minratio settings.
Excessive Runtime (>2 hours for bacterial genomes)
Enable fast mode (fast=t) to reduce tip search and limit candidate sites. Increase minhits=2-3 to reduce alignment candidates. Set maxindel=1000-5000 if long structural variants aren't expected. Ensure adequate RAM to prevent memory swapping which severely impacts performance.
High Memory Usage During Alignment
Long reads require extended alignment matrices consuming 680MB per thread. Reduce thread count if approaching system memory limits. Enable usemodulo during both indexing and mapping for 50% memory reduction. Monitor swap usage - any swapping will degrade performance substantially.
NERSC Module Loading Failures
HPC environments may have different module names or unavailable versions. Override automatic detection by setting NERSC_HOST="" to disable module loading. Manually configure JAVA_HOME and PATH variables. Verify Java 1.8 compatibility for optimal performance.
Index Building Failures
Ensure adequate temporary disk space (3-5× genome size) for index construction. Check chrombits parameter for references with very long scaffolds (>100Mb). Use nodisk=t if sufficient RAM available (>10× genome size) but limited temporary storage.
Poor Performance on Nanopore Data
Ensure using mapPacBio.sh rather than standard BBMap. For R10+ high-accuracy Nanopore data, consider increasing minratio=0.55-0.65 for improved specificity. Very long reads (>20kb) may benefit from increased maxindel settings and reduced minhits values.
Threading Performance Issues
MapPacBio auto-adjusts threads based on memory availability (680MB per thread). Override with threads=N if automatic detection is suboptimal. Single-thread mode (threads=1) useful for debugging alignment issues. Memory bandwidth often limits scaling beyond 16-32 threads for large genomes.

Related Tools

Support

For questions and support: