BBRealign
Realigns mapped reads to a reference using dynamic programming alignment algorithms. This tool takes SAM or BAM files and improves the alignment of reads to the reference sequence, particularly useful for correcting alignment artifacts and improving indel detection accuracy.
Basic Usage
bbrealign.sh in=<file> ref=<file> out=<file>
Input may be a sorted or unsorted SAM or BAM file. The reference should be in FASTA format. Realigned reads are output in the same format as the input.
Parameters
Parameters are organized by their function in the realignment process. The tool processes reads by first applying SAM filters, then performing realignment using dynamic programming, and finally applying quality trimming.
I/O parameters
- in=<file>
- Input reads in SAM or BAM format. Can be sorted or unsorted.
- out=<file>
- Output realigned reads in SAM or BAM format (same as input format).
- ref=<file>
- Reference genome in FASTA format used for realignment.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
Trimming parameters
- border=0
- Trim at least this many bases on both ends of reads after realignment.
- qtrim=r
- Quality-trim reads on this end. Options: r (right), l (left), rl (both), f (don't quality-trim).
- trimq=10
- Quality-trim bases below this Phred score threshold.
Realignment parameters
- unclip=f
- Convert clip symbols from exceeding the ends of the realignment zone into matches and substitutions. When true, clipped sequences are included in the realignment process.
- repadding=70
- Pad alignment by this many bases on each end of the realignment zone. Longer padding is more accurate for long indels but reduces speed. Recommended range: 50-200.
- rerows=602
- Maximum number of rows (read length) for the dynamic programming matrix. Reads longer than this cannot be realigned and will be passed through unchanged.
- recols=2000
- Maximum number of columns (reference segment length) for realignment. Must be at least read length plus maximum deletion length plus twice the padding value.
- msa=
- Select the multiple sequence aligner algorithm. Options:
• MultiStateAligner11ts (default): Optimized for Illumina reads
• MultiStateAligner9PacBio: Use for PacBio/Nanopore reads or Illumina reads mapped to long-read assemblies
Sam-filtering parameters
- minpos=
- Ignore alignments that do not overlap this genomic position range (start coordinate).
- maxpos=
- Ignore alignments that do not overlap this genomic position range (end coordinate).
- minreadmapq=4
- Ignore alignments with mapping quality lower than this value. Helps filter poorly mapped reads.
- contigs=
- Comma-delimited list of contig/chromosome names to include in processing. Names should have no spaces, or use underscores instead of spaces.
- secondary=f
- Include secondary alignments (reads with multiple mapping locations) in realignment processing.
- supplementary=f
- Include supplementary alignments (chimeric alignments) in realignment processing.
- invert=f
- Invert all SAM filtering criteria. Reads that would normally be excluded will be included and vice versa.
Java Parameters
- -Xmx
- Set Java's maximum heap memory usage, overriding autodetection. Examples: -Xmx20g (20 GB), -Xmx200m (200 MB). Maximum is typically 85% of physical memory.
- -eoom
- Exit process if an out-of-memory exception occurs. Requires Java 8u92 or later. Useful for preventing zombie processes.
- -da
- Disable Java assertions for slightly better performance in production use.
Examples
Basic Realignment
bbrealign.sh in=mapped.bam ref=genome.fa out=realigned.bam
Realigns all reads in a BAM file to improve alignment accuracy around indels.
PacBio/Nanopore Realignment
bbrealign.sh in=longreads.sam ref=assembly.fa out=realigned.sam msa=MultiStateAligner9PacBio rerows=10000 recols=50000
Optimized settings for long reads with larger matrix dimensions and PacBio-specific algorithm.
Quality Trimming with Realignment
bbrealign.sh in=reads.bam ref=genome.fa out=trimmed_realigned.bam qtrim=rl trimq=20 border=5
Realigns reads then trims low-quality bases from both ends and removes 5 bases from each end.
Filtered Realignment
bbrealign.sh in=all_reads.bam ref=genome.fa out=high_quality.bam minreadmapq=20 secondary=f supplementary=f
Only realigns high-quality primary alignments, excluding secondary and supplementary alignments.
Algorithm Details
Realignment Strategy
BBRealign uses Multiple Sequence Alignment (MSA) with dynamic programming to re-align reads that exhibit poor alignment characteristics. The Realigner class implements a glocal alignment strategy with padding around the original alignment region, retaining only realignments that improve the alignment score.
1. Alignment Quality Assessment (Realigner.realign method)
- Match Symbol Analysis: Uses Read.countMatchSymbols() to count matches, substitutions, clips, insertions, and deletions
- Realignment Criteria: Triggers realignment if reads have clips OR >3 bad symbols (substitutions+indels) OR complex indel patterns (>1 indel OR indel+substitution)
- Quality Filter: Skips high-quality alignments with <3 substitutions, no clips, <2 indels, and <3 total bad symbols
2. Multiple Sequence Alignment Engine
- MSA Factory Pattern: Uses MSA.makeMSA() to create algorithm-specific aligners based on read type
- MultiStateAligner11ts: Default aligner using 11-state transition model optimized for Illumina reads with low error rates
- MultiStateAligner9PacBio: 9-state model with higher gap penalties for long reads with elevated error rates
- Dynamic Matrix Resizing: Automatically expands MSA matrix if read length exceeds current capacity (length+2+length/4+2*padding)
- Glocal Alignment: Uses MSA.fillLimited() with score threshold to perform bounded dynamic programming alignment
3. Reference Padding Strategy
- Padded Region Creation: makeRbases() method extends reference segment by padding bases on each side
- Boundary Handling: Uses 'N' characters for out-of-bounds positions beyond reference length
- Coordinate Correction: Adjusts SiteScore coordinates by adding paddedStart offset to account for padding
4. Alignment Score Evaluation
- Score Calculation: Uses MSA.score() method on match strings to compute alignment quality
- Improvement Threshold: Only retains realignments with score > original (or score > 1 for clipped reads)
- Traceback Generation: Creates new match string using MSA.traceback() for improved alignments
5. Clipping and Coordinate Management
- Clipping Analysis: Uses SamLine.countLeadingClip() and SamLine.countTrailingClip() to identify soft/hard clips
- Terminal Indel Clipping: Calls SiteScore.clipTipIndels() to remove problematic terminal insertions/deletions
- Optional Unclipping: SiteScore.unclip() can restore clipped terminal indels if they improve alignment
- CIGAR Reconstruction: Generates new CIGAR string using SamLine.toCigar14() from improved match string
6. Quality Control and Filtering
- Match String Validation: Rejects alignments containing problematic symbols (X, Y, A, B) via SiteScore.matchContainsXY/AB()
- Quality Trimming: Applies TrimRead.testOptimal() for Phred-based quality trimming after realignment
- Border Trimming: Uses TrimRead.trimReadWithMatch() to remove specified bases from read ends
Implementation Characteristics
- Multithreading: ProcessThread class creates per-thread Realigner instances to avoid synchronization overhead
- Memory Management: Default 4GB heap allocation via calcXmx() with 84% physical memory limit
- Stream Processing: Uses ConcurrentReadInputStream/OutputStream with configurable buffer sizes for parallel I/O
- Reference Loading: ScafMap.loadReference() indexes FASTA sequences for efficient scaffold lookup
- Performance Tracking: Maintains counters for realignmentsAttempted, realignmentsSucceeded, realignmentsImproved, realignmentsRetained
When to Use BBRealign
- Indel-Rich Regions: Improves alignment accuracy for reads with insertions/deletions >1bp poorly handled by simpler aligners
- Variant Calling Preparation: Preprocessing step to improve SNP/indel detection sensitivity and specificity
- Assembly Improvement: Realigning reads to polished assemblies or updated reference sequences
- Long Read Error Correction: Correcting systematic alignment artifacts in PacBio/Nanopore data using MultiStateAligner9PacBio
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org