GlocalAligner

Script: glocalaligner.sh Package: aligner Class: GlocalAligner.java

Aligns a query sequence to a reference using GlocalAligner. Explores the entire matrix. The sequences can be any characters, but N is a special case. Outputs the identity, rstart, and rstop positions. Optionally prints a state space exploration map. This map can be fed to visualizealignment.sh to make an image.

Basic Usage

glocalaligner.sh <query> <ref>
glocalaligner.sh <query> <ref> <map>
glocalaligner.sh <query> <ref> <map> <iterations>

The glocal aligner performs global-local alignment, exploring the entire alignment matrix to find the optimal alignment between a query sequence and a reference sequence. It can work with any character sequences but treats 'N' as a special case for nucleotide sequences.

Parameters

The glocal aligner takes positional parameters for input sequences and optional output options.

Input Parameters

query: A literal nucleotide sequence or path to a FASTA file containing the query sequence to be aligned. The query can be shorter than the reference for optimal performance.
ref: A literal nucleotide sequence or path to a FASTA file containing the reference sequence. The reference sequence is limited to 2MB in length due to the 21-bit position encoding.

Output Parameters

map: Optional output text file for matrix score space visualization. The file contains the state space exploration map that can be fed to visualizealignment.sh to generate an image of the alignment matrix. Set to "null" for benchmarking runs without visualization overhead.
iterations: Optional integer parameter for benchmarking multiple iterations of the same alignment. Useful for performance testing and timing measurements.

Examples

Basic Sequence Alignment

glocalaligner.sh ATCGATCGATCG ATCGATCGATCGATCG

Aligns the query sequence "ATCGATCGATCG" to the reference "ATCGATCGATCGATCG" and outputs the identity score and alignment coordinates.

FASTA File Alignment

glocalaligner.sh query.fasta reference.fasta

Aligns sequences from FASTA files and reports the optimal alignment identity and position coordinates.

Alignment with Visualization

glocalaligner.sh ATCGATCG ATCGATCGATCG alignment_matrix.txt

Performs alignment and saves the matrix exploration map to a text file for visualization with visualizealignment.sh.

Benchmarking Mode

glocalaligner.sh ATCGATCGATCG ATCGATCGATCGATCG null 1000

Runs the alignment 1000 times for performance benchmarking without generating visualization output.

Algorithm Details

Glocal Alignment Strategy

GlocalAligner implements a space-efficient global-local alignment algorithm that explores the entire dynamic programming matrix to find the optimal alignment. Unlike traditional algorithms that require O(mn) space for traceback, this implementation uses only two arrays and calculates alignment statistics without explicit traceback.

Memory Efficiency

The algorithm uses a dual-array approach with only two long arrays: prev and curr, representing the previous and current rows of the alignment matrix. This reduces space complexity from O(mn) to O(n) where n is the reference length, while still exploring the complete alignment space.

Bit Field Encoding

The algorithm employs 64-bit bit field encoding using POSITION_MASK, DEL_MASK, and SCORE_MASK constants to store multiple pieces of information in a single long integer:

Position bits (21 bits): Track the starting position of optimal alignments, allowing references up to 2MB in length
Deletion bits (21 bits): Count deletion events during alignment for accurate statistics calculation
Score bits (22 bits): Store the actual alignment score in the remaining high-order bits

Scoring System

The alignment scoring system uses the following constants defined in the implementation:

Match: MATCH constant (+1 << SCORE_SHIFT) for identical characters (except N)
Substitution: SUB constant (-1 << SCORE_SHIFT) for mismatched characters
Insertion: INS constant (-1 << SCORE_SHIFT) for query characters not in reference
Deletion: DEL constant (-1 << SCORE_SHIFT) for reference characters not in query
N characters: N_SCORE constant (0L) when either query or reference contains N

Global vs. Local Behavior

The algorithm can operate in both global and local modes, controlled by the GLOBAL flag. In local mode, it finds the best-scoring subsequence alignment. In global mode, it performs end-to-end alignment with gap penalties.

Sequence Optimization

For improved performance, the algorithm automatically swaps query and reference sequences when the query is longer than the reference (only when position vector is not needed), ensuring the shorter sequence drives the outer loop.

Statistics Calculation

The algorithm calculates precise alignment statistics using the postprocess() method that solves a system of equations derived from the alignment operations:

Matches + Substitutions + Insertions = Query Length
Matches + Substitutions + Deletions = Reference Alignment Length
Matches - Substitutions - Insertions - Deletions = Raw Score

The postprocess() method extracts the deletions using (bestScore & DEL_MASK) >> POSITION_BITS and calculates matches as ((rawScore+qLen+deletions)/2f), eliminating the need for traceback while providing exact operation counts.

Performance Characteristics

The algorithm has O(mn) time complexity where m is query length and n is reference length, implemented with branchless conditional scoring using Math.max() operations and optimized score calculations such as maxValue=(maxDiagUp&SCORE_MASK)>=leftScore for improved cache performance. The space complexity is O(n) using dual rolling arrays, making it suitable for large reference sequences up to 2MB. Performance tracking uses AtomicLong loops counter to measure computational work.

Visualization Support

When enabled through the output parameter, the algorithm creates a Visualizer instance (new Visualizer(output, POSITION_BITS, DEL_BITS)) that generates a state space exploration map. The Visualizer.print() method outputs the progression of the dynamic programming algorithm through the alignment matrix at each row, useful for understanding alignment behavior and debugging.

Limitations

The current implementation has the following constraints:

Reference sequences are limited to 2MB (2,097,152 bases) due to 21-bit position encoding
The algorithm is designed for nucleotide sequences but can work with any character alphabet
N characters receive special treatment as neutral/unknown bases

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org