GlocalAligner
Aligns a query sequence to a reference using GlocalAligner. Explores the entire matrix. The sequences can be any characters, but N is a special case. Outputs the identity, rstart, and rstop positions. Optionally prints a state space exploration map. This map can be fed to visualizealignment.sh to make an image.
Basic Usage
glocalaligner.sh <query> <ref>
glocalaligner.sh <query> <ref> <map>
glocalaligner.sh <query> <ref> <map> <iterations>
The glocal aligner performs global-local alignment, exploring the entire alignment matrix to find the optimal alignment between a query sequence and a reference sequence. It can work with any character sequences but treats 'N' as a special case for nucleotide sequences.
Parameters
The glocal aligner takes positional parameters for input sequences and optional output options.
Input Parameters
- query
- A literal nucleotide sequence or path to a FASTA file containing the query sequence to be aligned. The query can be shorter than the reference for optimal performance.
- ref
- A literal nucleotide sequence or path to a FASTA file containing the reference sequence. The reference sequence is limited to 2MB in length due to the 21-bit position encoding.
Output Parameters
- map
- Optional output text file for matrix score space visualization. The file contains the state space exploration map that can be fed to visualizealignment.sh to generate an image of the alignment matrix. Set to "null" for benchmarking runs without visualization overhead.
- iterations
- Optional integer parameter for benchmarking multiple iterations of the same alignment. Useful for performance testing and timing measurements.
Examples
Basic Sequence Alignment
glocalaligner.sh ATCGATCGATCG ATCGATCGATCGATCG
Aligns the query sequence "ATCGATCGATCG" to the reference "ATCGATCGATCGATCG" and outputs the identity score and alignment coordinates.
FASTA File Alignment
glocalaligner.sh query.fasta reference.fasta
Aligns sequences from FASTA files and reports the optimal alignment identity and position coordinates.
Alignment with Visualization
glocalaligner.sh ATCGATCG ATCGATCGATCG alignment_matrix.txt
Performs alignment and saves the matrix exploration map to a text file for visualization with visualizealignment.sh.
Benchmarking Mode
glocalaligner.sh ATCGATCGATCG ATCGATCGATCGATCG null 1000
Runs the alignment 1000 times for performance benchmarking without generating visualization output.
Algorithm Details
Glocal Alignment Strategy
GlocalAligner implements a space-efficient global-local alignment algorithm that explores the entire dynamic programming matrix to find the optimal alignment. Unlike traditional algorithms that require O(mn) space for traceback, this implementation uses only two arrays and calculates alignment statistics without explicit traceback.
Memory Efficiency
The algorithm uses a dual-array approach with only two long arrays: prev
and curr
, representing the previous and current rows of the alignment matrix. This reduces space complexity from O(mn) to O(n) where n is the reference length, while still exploring the complete alignment space.
Bit Field Encoding
The algorithm employs 64-bit bit field encoding using POSITION_MASK, DEL_MASK, and SCORE_MASK constants to store multiple pieces of information in a single long integer:
- Position bits (21 bits): Track the starting position of optimal alignments, allowing references up to 2MB in length
- Deletion bits (21 bits): Count deletion events during alignment for accurate statistics calculation
- Score bits (22 bits): Store the actual alignment score in the remaining high-order bits
Scoring System
The alignment scoring system uses the following constants defined in the implementation:
- Match: MATCH constant (+1 << SCORE_SHIFT) for identical characters (except N)
- Substitution: SUB constant (-1 << SCORE_SHIFT) for mismatched characters
- Insertion: INS constant (-1 << SCORE_SHIFT) for query characters not in reference
- Deletion: DEL constant (-1 << SCORE_SHIFT) for reference characters not in query
- N characters: N_SCORE constant (0L) when either query or reference contains N
Global vs. Local Behavior
The algorithm can operate in both global and local modes, controlled by the GLOBAL
flag. In local mode, it finds the best-scoring subsequence alignment. In global mode, it performs end-to-end alignment with gap penalties.
Sequence Optimization
For improved performance, the algorithm automatically swaps query and reference sequences when the query is longer than the reference (only when position vector is not needed), ensuring the shorter sequence drives the outer loop.
Statistics Calculation
The algorithm calculates precise alignment statistics using the postprocess() method that solves a system of equations derived from the alignment operations:
- Matches + Substitutions + Insertions = Query Length
- Matches + Substitutions + Deletions = Reference Alignment Length
- Matches - Substitutions - Insertions - Deletions = Raw Score
The postprocess() method extracts the deletions using (bestScore & DEL_MASK) >> POSITION_BITS and calculates matches as ((rawScore+qLen+deletions)/2f), eliminating the need for traceback while providing exact operation counts.
Performance Characteristics
The algorithm has O(mn) time complexity where m is query length and n is reference length, implemented with branchless conditional scoring using Math.max() operations and optimized score calculations such as maxValue=(maxDiagUp&SCORE_MASK)>=leftScore for improved cache performance. The space complexity is O(n) using dual rolling arrays, making it suitable for large reference sequences up to 2MB. Performance tracking uses AtomicLong loops counter to measure computational work.
Visualization Support
When enabled through the output parameter, the algorithm creates a Visualizer instance (new Visualizer(output, POSITION_BITS, DEL_BITS)) that generates a state space exploration map. The Visualizer.print() method outputs the progression of the dynamic programming algorithm through the alignment matrix at each row, useful for understanding alignment behavior and debugging.
Limitations
The current implementation has the following constraints:
- Reference sequences are limited to 2MB (2,097,152 bases) due to 21-bit position encoding
- The algorithm is designed for nucleotide sequences but can work with any character alphabet
- N characters receive special treatment as neutral/unknown bases
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org