WobblePlusAligner

Script: wobbleplusaligner.sh Package: aligner Class: WobblePlusAligner3.java

Aligns a query sequence to a reference using WobblePlusAligner3. The sequences can be any characters, but N is a special case. Outputs the identity, rstart, and rstop positions. Optionally prints a state space exploration map. This map can be fed to visualizealignment.sh to make an image.

Basic Usage

wobbleplusaligner.sh <query> <ref>
wobbleplusaligner.sh <query> <ref> <map>
wobbleplusaligner.sh <query> <ref> <map> <iterations> <simd>

WobblePlusAligner3 performs sequence alignment using a banded dynamic programming algorithm optimized for sequences with high identity and minimal indels. The tool uses only 2 arrays (prev[rLen+1] and curr[rLen+1]) avoiding full O(mn) matrix storage, implements traceback-free alignment calculation with 64-bit bit-packed encoding, and calculates bandwidth using Tools.mid(7, 1+max(qLen,rLen)/24, 24) with mismatch-based adjustment.

Parameters

Parameters are specified as positional arguments in the order shown below.

Required Parameters

query
A literal nucleotide sequence or fasta file containing the query sequence to align.
ref
A literal nucleotide sequence or fasta file containing the reference sequence for alignment.

Optional Parameters

map
Optional output text file for matrix score space visualization. Set to null for benchmarking with no visualization output. This file can be fed to visualizealignment.sh to create an image representation of the alignment matrix.
iterations
Optional integer for benchmarking multiple iterations of the same alignment. Useful for performance testing.
simd
Add this flag to use SIMD (Single Instruction, Multiple Data) mode for vectorized operations using shared.SIMDAlign.alignBandVectorAndReturnMaxPos() on supported processors.

Examples

Basic Alignment

wobbleplusaligner.sh ATCGATCG ATCGATCG

Aligns two identical sequences, should return identity of 1.0.

File-based Alignment

wobbleplusaligner.sh query.fasta reference.fasta

Aligns sequences from FASTA files, outputting identity and alignment coordinates.

Alignment with Visualization

wobbleplusaligner.sh query.fasta reference.fasta alignment_matrix.txt

Performs alignment and saves the score matrix to a text file for visualization.

Performance Benchmarking

wobbleplusaligner.sh query.fasta reference.fasta null 1000 simd

Runs the alignment 1000 times using SIMD acceleration, with no visualization output for pure performance testing.

Algorithm Details

Core Algorithm

WobblePlusAligner3 implements a banded dynamic programming alignment algorithm using the decideBandwidth() method and RingBuffer for quality assessment. The algorithm encodes position and deletion information directly in 64-bit score values using bit fields (21-bit position, 21-bit deletion, 22-bit score), achieving O(n) space complexity through dual rolling arrays. Technical implementation:

Dynamic Bandwidth Calculation

The initial bandwidth is calculated using the formula: Tools.mid(7, 1+max(qLen,rLen)/24, 24), then refined by counting mismatches in sequence prefixes using decideBandwidth(). During alignment, bandwidth adapts using a RingBuffer(ringSize) where ringSize=(bandWidth0*5)/4 that tracks recent alignment quality. Poor local scores trigger bandwidth expansion up to ringSize*2 positions, while good scores maintain narrow bands to reduce computational work.

Band Center Tracking

The band center updates using: center = center + 1 + drift, where drift = Tools.mid(-1, maxPos-center, 2). This allows the band to follow the optimal alignment path while preventing drift beyond ±2 positions per row. The band boundaries are: bandStart = max(bandStart, center-bandWidth+quarterBand) and bandEnd = min(rLen, center+bandWidth+quarterBand).

Dual Array Matrix Processing

The algorithm allocates only two long arrays: prev[rLen+1] and curr[rLen+1], swapping roles each iteration. This eliminates the need for full O(n*m) matrix storage, reducing memory requirements to O(reference_length). Array initialization uses Arrays.fill(curr, BAD) where BAD = Long.MIN_VALUE/2 to prevent overflow during score calculations.

Long-Encoded Sequence Processing

Sequences are converted to long[] arrays using Factory.encodeLong() with N-base encoding: query N=15, reference N=31. The encoding formula ((q|r)>=15) detects N matches, which receive neutral scoring (N_SCORE=0L). This encoding enables vectorized operations and bitwise comparison during alignment matrix calculations using shared.SIMDAlign.alignBandVectorAndReturnMaxPos().

Scoring System

Bit-Field Score Encoding

Alignment information is encoded directly in 64-bit score values without requiring traceback matrices:

Score operations: MATCH = 1L << 42, SUB/INS/DEL = (-1L) << 42, DEL_INCREMENT = DEL + (1L << 21) for position tracking.

Vectorized Processing Mode

SIMD mode (enabled via command line flag) delegates band processing to shared.SIMDAlign.alignBandVectorAndReturnMaxPos(), which processes multiple alignment cells simultaneously using processor vector units. The vectorized code handles the main alignment loop (match/substitution/insertion calculations) while the Java code processes the deletion tail loop sequentially for correctness.

Quality-Based Bandwidth Expansion

A RingBuffer(ringSize) tracks recent maximum scores to detect local alignment degradation. When recentMissingScore = (oldScore + ringSize) - currentScore exceeds thresholds, scoreBonus = min(ringSize*2, recentMissingScore*2) expands bandwidth. Near sequence starts, additional expansion follows: bandWidth0 + max(10 + bandWidth0*8 - maxDrift*i, scoreBonus).

Performance Characteristics

Identity Calculation

The algorithm extracts alignment statistics without traceback using bit field decoding:

Identity calculation solves the system: (1) M+S+I=qLen, (2) M+S+D=refAlnLength, (3) Score=M-S-I-D, where:

Technical Specifications

Limitations

Memory Usage

Memory requirements scale linearly with reference sequence length. Approximate memory usage:

Compatibility

Support

For questions and support: