DriftingPlusAligner
Aligns a query sequence to a reference using DriftingPlusAligner2. The sequences can be any characters, but N is a special case. Outputs the identity, rstart, and rstop positions. Optionally prints a state space exploration map. This map can be fed to visualizealignment.sh to make an image.
Basic Usage
driftingplusaligner.sh <query> <ref>
driftingplusaligner.sh <query> <ref> <map>
driftingplusaligner.sh <query> <ref> <map> <iterations>
DriftingPlusAligner2 is a highly optimized pairwise sequence aligner designed for high-identity alignments. It can accept sequences as literal nucleotide strings or as fasta files.
Parameters
Parameters are specified as positional arguments rather than named parameters, following the pattern shown in the usage section.
Parameters
- query
- A literal nucleotide sequence or fasta file. This is the sequence to align against the reference. Can contain any characters, though N nucleotides receive special handling in scoring.
- ref
- A literal nucleotide sequence or fasta file. This serves as the reference sequence for alignment. The aligner may internally swap query and reference if the query is longer than the reference to optimize performance.
- map
- Optional output text file for matrix score space visualization. This file contains the state space exploration data that can be fed to visualizealignment.sh to generate alignment visualization images. Set to "null" for benchmarking scenarios where visualization output is not needed.
- iterations
- Optional integer for benchmarking multiple alignment iterations. Useful for performance testing and timing measurements of the alignment algorithm.
- simd
- Add this flag to enable SIMD (Single Instruction, Multiple Data) mode for vectorized alignment operations. This can significantly accelerate alignment computation on supported hardware platforms.
Examples
Basic Alignment
driftingplusaligner.sh ATCGATCG ATCGATCGATCG
Aligns a short query sequence against a longer reference sequence, reporting identity and alignment coordinates.
File-Based Alignment
driftingplusaligner.sh query.fasta reference.fasta
Aligns sequences from fasta files. Both single-sequence and multi-sequence fasta files are supported.
Alignment with Visualization
driftingplusaligner.sh query.fasta reference.fasta alignment_map.txt
Performs alignment and saves the state space exploration matrix to a file for visualization with visualizealignment.sh.
SIMD-Accelerated Alignment
driftingplusaligner.sh ATCGATCG ATCGATCGATCG null 1 simd
Uses SIMD vectorization for faster alignment computation. Note that map is set to "null" and iterations to "1" to reach the simd flag.
Benchmarking
driftingplusaligner.sh query.fasta reference.fasta null 1000
Runs alignment 1000 times for performance benchmarking, with visualization disabled for accurate timing measurements.
Algorithm Details
DriftingPlusAligner2 implements an advanced banded dynamic programming algorithm optimized for high-identity sequence alignments. The algorithm incorporates several key innovations:
Adaptive Banding Strategy
The aligner uses a sophisticated adaptive banding approach that balances speed and accuracy:
- Dynamic bandwidth calculation: Initial bandwidth is determined by examining sequence similarity in the first portion of the alignment, using the formula: bandwidth = max(8, 1 + max(qLen, rLen)/16, 40 + sqrt(rLen)/4)
- Drift compensation: The band center drifts toward the highest-scoring alignment path, with a maximum drift of 2 positions per row
- Quality-responsive widening: Low alignment quality triggers automatic bandwidth expansion to accommodate more divergent sequences
- Glocal alignment support: Band starts wide and narrows to enable both global and local alignment modes
Memory Optimization
The algorithm achieves exceptional memory efficiency through several design choices:
- Two-array approach: Uses only two arrays (current and previous row) instead of storing the entire alignment matrix
- Traceback-free design: Calculates alignment coordinates without requiring traceback operations
- Bit-packed scoring: Uses 64-bit integers with bit fields for position tracking (21 bits), deletion counting (21 bits), and scoring (22 bits)
- Length limitation: Supports sequences up to 2 Mbp with 21-bit position encoding
Scoring System
The alignment scoring incorporates nucleotide-specific handling:
- Match scoring: +1 for exact nucleotide matches
- Mismatch penalties: -1 for substitutions
- Indel penalties: -1 for insertions and deletions
- N nucleotide handling: N bases receive neutral scoring (0) to avoid penalizing ambiguous positions
- Global vs local modes: Configurable through the GLOBAL constant for different alignment strategies
SIMD Acceleration
When SIMD mode is enabled, the algorithm leverages vectorized operations:
- Vectorized scoring: Processes multiple alignment cells simultaneously using SIMD instructions
- Hardware optimization: Takes advantage of modern CPU vector units for parallel computation
- Bandwidth-aware processing: Maintains the adaptive banding strategy while using vectorized operations within bands
Position Calculation
The algorithm calculates alignment coordinates using a mathematical approach:
- System of equations: Solves M + S + I = qLen, M + S + D = refAlnLength, Score = M - S - I - D
- Operation counting: Derives match, substitution, insertion, and deletion counts from final scores
- Identity calculation: Computes alignment identity as matches/(matches + substitutions + insertions + deletions)
Performance Characteristics
- Time complexity: O(n*m*b) where n and m are sequence lengths and b is average bandwidth
- Space complexity: O(m) linear space usage regardless of sequence length
- Optimized for: High-identity alignments (>70% identity) where banding is most effective
- Memory usage: Default 200MB allocation, scales with reference sequence length
Output Format
DriftingPlusAligner2 outputs alignment results in a simple format:
- Identity: Floating-point value between 0.0 and 1.0 representing alignment identity
- Reference start (rstart): Starting position of the alignment in the reference sequence
- Reference stop (rstop): Ending position of the alignment in the reference sequence
- Visualization map: Optional matrix file containing state space exploration data for alignment visualization
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org