DriftingPlusAligner

Script: driftingplusaligner.sh Package: aligner Class: DriftingPlusAligner2.java

Aligns a query sequence to a reference using DriftingPlusAligner2. The sequences can be any characters, but N is a special case. Outputs the identity, rstart, and rstop positions. Optionally prints a state space exploration map. This map can be fed to visualizealignment.sh to make an image.

Basic Usage

driftingplusaligner.sh <query> <ref>
driftingplusaligner.sh <query> <ref> <map>
driftingplusaligner.sh <query> <ref> <map> <iterations>

DriftingPlusAligner2 is a highly optimized pairwise sequence aligner designed for high-identity alignments. It can accept sequences as literal nucleotide strings or as fasta files.

Parameters

Parameters are specified as positional arguments rather than named parameters, following the pattern shown in the usage section.

Parameters

query: A literal nucleotide sequence or fasta file. This is the sequence to align against the reference. Can contain any characters, though N nucleotides receive special handling in scoring.
ref: A literal nucleotide sequence or fasta file. This serves as the reference sequence for alignment. The aligner may internally swap query and reference if the query is longer than the reference to optimize performance.
map: Optional output text file for matrix score space visualization. This file contains the state space exploration data that can be fed to visualizealignment.sh to generate alignment visualization images. Set to "null" for benchmarking scenarios where visualization output is not needed.
iterations: Optional integer for benchmarking multiple alignment iterations. Useful for performance testing and timing measurements of the alignment algorithm.
simd: Add this flag to enable SIMD (Single Instruction, Multiple Data) mode for vectorized alignment operations. This can significantly accelerate alignment computation on supported hardware platforms.

Examples

Basic Alignment

driftingplusaligner.sh ATCGATCG ATCGATCGATCG

Aligns a short query sequence against a longer reference sequence, reporting identity and alignment coordinates.

File-Based Alignment

driftingplusaligner.sh query.fasta reference.fasta

Aligns sequences from fasta files. Both single-sequence and multi-sequence fasta files are supported.

Alignment with Visualization

driftingplusaligner.sh query.fasta reference.fasta alignment_map.txt

Performs alignment and saves the state space exploration matrix to a file for visualization with visualizealignment.sh.

SIMD-Accelerated Alignment

driftingplusaligner.sh ATCGATCG ATCGATCGATCG null 1 simd

Uses SIMD vectorization for faster alignment computation. Note that map is set to "null" and iterations to "1" to reach the simd flag.

Benchmarking

driftingplusaligner.sh query.fasta reference.fasta null 1000

Runs alignment 1000 times for performance benchmarking, with visualization disabled for accurate timing measurements.

Algorithm Details

DriftingPlusAligner2 implements an advanced banded dynamic programming algorithm optimized for high-identity sequence alignments. The algorithm incorporates several key innovations:

Adaptive Banding Strategy

The aligner uses a sophisticated adaptive banding approach that balances speed and accuracy:

Dynamic bandwidth calculation: Initial bandwidth is determined by examining sequence similarity in the first portion of the alignment, using the formula: bandwidth = max(8, 1 + max(qLen, rLen)/16, 40 + sqrt(rLen)/4)
Drift compensation: The band center drifts toward the highest-scoring alignment path, with a maximum drift of 2 positions per row
Quality-responsive widening: Low alignment quality triggers automatic bandwidth expansion to accommodate more divergent sequences
Glocal alignment support: Band starts wide and narrows to enable both global and local alignment modes

Memory Optimization

The algorithm achieves exceptional memory efficiency through several design choices:

Two-array approach: Uses only two arrays (current and previous row) instead of storing the entire alignment matrix
Traceback-free design: Calculates alignment coordinates without requiring traceback operations
Bit-packed scoring: Uses 64-bit integers with bit fields for position tracking (21 bits), deletion counting (21 bits), and scoring (22 bits)
Length limitation: Supports sequences up to 2 Mbp with 21-bit position encoding

Scoring System

The alignment scoring incorporates nucleotide-specific handling:

Match scoring: +1 for exact nucleotide matches
Mismatch penalties: -1 for substitutions
Indel penalties: -1 for insertions and deletions
N nucleotide handling: N bases receive neutral scoring (0) to avoid penalizing ambiguous positions
Global vs local modes: Configurable through the GLOBAL constant for different alignment strategies

SIMD Acceleration

When SIMD mode is enabled, the algorithm leverages vectorized operations:

Vectorized scoring: Processes multiple alignment cells simultaneously using SIMD instructions
Hardware optimization: Takes advantage of modern CPU vector units for parallel computation
Bandwidth-aware processing: Maintains the adaptive banding strategy while using vectorized operations within bands

Position Calculation

The algorithm calculates alignment coordinates using a mathematical approach:

System of equations: Solves M + S + I = qLen, M + S + D = refAlnLength, Score = M - S - I - D
Operation counting: Derives match, substitution, insertion, and deletion counts from final scores
Identity calculation: Computes alignment identity as matches/(matches + substitutions + insertions + deletions)

Performance Characteristics

Time complexity: O(n*m*b) where n and m are sequence lengths and b is average bandwidth
Space complexity: O(m) linear space usage regardless of sequence length
Optimized for: High-identity alignments (>70% identity) where banding is most effective
Memory usage: Default 200MB allocation, scales with reference sequence length

Output Format

DriftingPlusAligner2 outputs alignment results in a simple format:

Identity: Floating-point value between 0.0 and 1.0 representing alignment identity
Reference start (rstart): Starting position of the alignment in the reference sequence
Reference stop (rstop): Ending position of the alignment in the reference sequence
Visualization map: Optional matrix file containing state space exploration data for alignment visualization

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org