BBRealign

Basic Usage

bbrealign.sh in=<file> ref=<file> out=<file>

Input may be a sorted or unsorted SAM or BAM file. The reference should be in FASTA format. Realigned reads are output in the same format as the input.

Parameters

Parameters are organized by their function in the realignment process. The tool processes reads by first applying SAM filters, then performing realignment using dynamic programming, and finally applying quality trimming.

I/O parameters

in=<file>: Input reads in SAM or BAM format. Can be sorted or unsorted.
out=<file>: Output realigned reads in SAM or BAM format (same as input format).
ref=<file>: Reference genome in FASTA format used for realignment.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file.

Trimming parameters

border=0: Trim at least this many bases on both ends of reads after realignment.
qtrim=r: Quality-trim reads on this end. Options: r (right), l (left), rl (both), f (don't quality-trim).
trimq=10: Quality-trim bases below this Phred score threshold.

Realignment parameters

unclip=f: Convert clip symbols from exceeding the ends of the realignment zone into matches and substitutions. When true, clipped sequences are included in the realignment process.
repadding=70: Pad alignment by this many bases on each end of the realignment zone. Longer padding is more accurate for long indels but reduces speed. Recommended range: 50-200.
rerows=602: Maximum number of rows (read length) for the dynamic programming matrix. Reads longer than this cannot be realigned and will be passed through unchanged.
recols=2000: Maximum number of columns (reference segment length) for realignment. Must be at least read length plus maximum deletion length plus twice the padding value.
msa=: Select the multiple sequence aligner algorithm. Options:
• MultiStateAligner11ts (default): Optimized for Illumina reads
• MultiStateAligner9PacBio: Use for PacBio/Nanopore reads or Illumina reads mapped to long-read assemblies

Sam-filtering parameters

minpos=: Ignore alignments that do not overlap this genomic position range (start coordinate).
maxpos=: Ignore alignments that do not overlap this genomic position range (end coordinate).
minreadmapq=4: Ignore alignments with mapping quality lower than this value. Helps filter poorly mapped reads.
contigs=: Comma-delimited list of contig/chromosome names to include in processing. Names should have no spaces, or use underscores instead of spaces.
secondary=f: Include secondary alignments (reads with multiple mapping locations) in realignment processing.
supplementary=f: Include supplementary alignments (chimeric alignments) in realignment processing.
invert=f: Invert all SAM filtering criteria. Reads that would normally be excluded will be included and vice versa.

Java Parameters

-Xmx: Set Java's maximum heap memory usage, overriding autodetection. Examples: -Xmx20g (20 GB), -Xmx200m (200 MB). Maximum is typically 85% of physical memory.
-eoom: Exit process if an out-of-memory exception occurs. Requires Java 8u92 or later. Useful for preventing zombie processes.
-da: Disable Java assertions for slightly better performance in production use.

Examples

Basic Realignment

bbrealign.sh in=mapped.bam ref=genome.fa out=realigned.bam

Realigns all reads in a BAM file to improve alignment accuracy around indels.

PacBio/Nanopore Realignment

bbrealign.sh in=longreads.sam ref=assembly.fa out=realigned.sam msa=MultiStateAligner9PacBio rerows=10000 recols=50000

Optimized settings for long reads with larger matrix dimensions and PacBio-specific algorithm.

Quality Trimming with Realignment

bbrealign.sh in=reads.bam ref=genome.fa out=trimmed_realigned.bam qtrim=rl trimq=20 border=5

Realigns reads then trims low-quality bases from both ends and removes 5 bases from each end.

Filtered Realignment

bbrealign.sh in=all_reads.bam ref=genome.fa out=high_quality.bam minreadmapq=20 secondary=f supplementary=f

Only realigns high-quality primary alignments, excluding secondary and supplementary alignments.

Algorithm Details

Realignment Strategy

BBRealign uses Multiple Sequence Alignment (MSA) with dynamic programming to re-align reads that exhibit poor alignment characteristics. The Realigner class implements a glocal alignment strategy with padding around the original alignment region, retaining only realignments that improve the alignment score.

1. Alignment Quality Assessment (Realigner.realign method)

Match Symbol Analysis: Uses Read.countMatchSymbols() to count matches, substitutions, clips, insertions, and deletions
Realignment Criteria: Triggers realignment if reads have clips OR >3 bad symbols (substitutions+indels) OR complex indel patterns (>1 indel OR indel+substitution)
Quality Filter: Skips high-quality alignments with <3 substitutions, no clips, <2 indels, and <3 total bad symbols

2. Multiple Sequence Alignment Engine

MSA Factory Pattern: Uses MSA.makeMSA() to create algorithm-specific aligners based on read type
MultiStateAligner11ts: Default aligner using 11-state transition model optimized for Illumina reads with low error rates
MultiStateAligner9PacBio: 9-state model with higher gap penalties for long reads with elevated error rates
Dynamic Matrix Resizing: Automatically expands MSA matrix if read length exceeds current capacity (length+2+length/4+2*padding)
Glocal Alignment: Uses MSA.fillLimited() with score threshold to perform bounded dynamic programming alignment

3. Reference Padding Strategy

Padded Region Creation: makeRbases() method extends reference segment by padding bases on each side
Boundary Handling: Uses 'N' characters for out-of-bounds positions beyond reference length
Coordinate Correction: Adjusts SiteScore coordinates by adding paddedStart offset to account for padding

4. Alignment Score Evaluation

Score Calculation: Uses MSA.score() method on match strings to compute alignment quality
Improvement Threshold: Only retains realignments with score > original (or score > 1 for clipped reads)
Traceback Generation: Creates new match string using MSA.traceback() for improved alignments

5. Clipping and Coordinate Management

Clipping Analysis: Uses SamLine.countLeadingClip() and SamLine.countTrailingClip() to identify soft/hard clips
Terminal Indel Clipping: Calls SiteScore.clipTipIndels() to remove problematic terminal insertions/deletions
Optional Unclipping: SiteScore.unclip() can restore clipped terminal indels if they improve alignment
CIGAR Reconstruction: Generates new CIGAR string using SamLine.toCigar14() from improved match string

6. Quality Control and Filtering

Match String Validation: Rejects alignments containing problematic symbols (X, Y, A, B) via SiteScore.matchContainsXY/AB()
Quality Trimming: Applies TrimRead.testOptimal() for Phred-based quality trimming after realignment
Border Trimming: Uses TrimRead.trimReadWithMatch() to remove specified bases from read ends

Implementation Characteristics

Multithreading: ProcessThread class creates per-thread Realigner instances to avoid synchronization overhead
Memory Management: Default 4GB heap allocation via calcXmx() with 84% physical memory limit
Stream Processing: Uses ConcurrentReadInputStream/OutputStream with configurable buffer sizes for parallel I/O
Reference Loading: ScafMap.loadReference() indexes FASTA sequences for efficient scaffold lookup
Performance Tracking: Maintains counters for realignmentsAttempted, realignmentsSucceeded, realignmentsImproved, realignmentsRetained

When to Use BBRealign

Indel-Rich Regions: Improves alignment accuracy for reads with insertions/deletions >1bp poorly handled by simpler aligners
Variant Calling Preparation: Preprocessing step to improve SNP/indel detection sensitivity and specificity
Assembly Improvement: Realigning reads to polished assemblies or updated reference sequences
Long Read Error Correction: Correcting systematic alignment artifacts in PacBio/Nanopore data using MultiStateAligner9PacBio

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org