ScrabbleAligner
Aligns sequences using adaptive banding with dynamic bandwidth adjustment. Uses only 2 arrays and avoids traceback while calculating exact ANI. Band center drifts toward highest scores, and bandwidth dynamically widens or narrows in response to sequence identity. Similar to WobbleAligner but trades the ring buffer for scalar variables for band tracking.
Basic Usage
scrabblealigner.sh <query> <ref>
scrabblealigner.sh <query> <ref> <map>
scrabblealigner.sh <query> <ref> <map> <iterations>
ScrabbleAligner accepts literal nucleotide sequences or FASTA files as input. Sequences can be any characters, but N is a special case (scores 0 to avoid penalizing unknown bases). The tool outputs identity, rstart, and rstop positions. Optionally generates a state space exploration map for visualization.
Parameters
Parameters are positional arguments. The tool also accepts standard Java memory flags.
Standard Parameters
- query
- A literal nucleotide sequence or fasta file.
- ref
- A literal nucleotide sequence or fasta file.
- map
- Optional output text file for matrix score space. Set to null for benchmarking with no visualization.
- iterations
- Optional integer for benchmarking multiple iterations.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 200m (fixed allocation).
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Output Format
ScrabbleAligner outputs alignment results to standard output:
Identity: 0.9876
Reference start: 123
Reference stop: 5432
Visualization Map
When the map parameter is specified, ScrabbleAligner generates a state space exploration map showing the alignment matrix cells explored during banded alignment. This map can be fed to visualizealignment.sh to create images showing the alignment pathway.
Examples
Basic Alignment
scrabblealigner.sh ACGTACGTACGT ACGTACGTACGT
Align two literal sequences and report identity.
FASTA File Input
scrabblealigner.sh query.fa reference.fa
Align sequences from FASTA files.
Generate Visualization Map
scrabblealigner.sh query.fa ref.fa alignment_map.txt
Align sequences and generate state space exploration map for visualization.
Visualization Pipeline
scrabblealigner.sh query.fa ref.fa map.txt
visualizealignment.sh map.txt alignment.png
Complete workflow: align sequences, generate map, create visualization image.
Performance Benchmarking
scrabblealigner.sh query.fa ref.fa null 1000
Run 1000 iterations with no visualization output for performance testing.
Algorithm Details
Adaptive Banding Strategy
ScrabbleAligner uses a sophisticated adaptive banding approach that balances speed with accuracy:
- Initial Bandwidth: Calculated by comparing the first min(qLen, rLen) bases. For high-identity sequences, the band starts narrow. For divergent sequences, it starts wider. The formula is:
mid(7, 1+max(qLen,rLen)/32, 20+sqrt(rLen)/8) - Dynamic Bandwidth Adjustment: As alignment progresses, the bandwidth widens when sequences diverge (rapid increase up to 8 cells per row) and narrows when sequences match (cautious reduction by doubling the decrement). This allows the algorithm to handle local variations while maintaining efficiency.
- Band Center Drift: The band center moves toward the highest-scoring position in each row (limited to ±2 cells per row). This allows the alignment to "follow" the optimal path without exploring the entire matrix.
- Glocal Capability: Band starts wide at the top (16 + 12×bandwidth) and narrows toward the bottom, allowing free initial insertions for glocal alignments while maintaining efficiency.
Memory-Efficient Design
- Two-Array Architecture: Uses only current and previous row arrays, avoiding O(m×n) memory of full matrix storage
- No Traceback: Calculates rstart and rstop from bit-packed position fields without storing alignment path
- Scalar Band Tracking: Unlike WobbleAligner's ring buffer approach, ScrabbleAligner uses simple scalar variables (bandStart, bandEnd, center) for band position tracking
- Memory Requirement: O(n) where n is reference length, with 200MB default allocation sufficient for sequences up to 2Mbp
Bit-Packing Scheme
Each cell stores three values in a single 64-bit long:
- Position bits [0-20]: 21 bits for reference start position (max 2,097,151 bp)
- Deletion bits [21-41]: 21 bits for deletion count in alignment
- Score bits [42-63]: 22 bits for alignment score
Scoring Scheme
- Match: +1
- Substitution: -1
- Insertion: -1
- Deletion: -1
- N (unknown base): 0 (neutral, no penalty)
Identity Calculation
ScrabbleAligner calculates exact identity without traceback by solving the system of equations:
M + S + I = qLen
M + S + D = refAlnLength
Score = M - S - I - D
Where M=matches, S=substitutions, I=insertions, D=deletions. The solution yields: matches = (score + qLen + deletions) / 2, from which identity = matches / (matches + substitutions + insertions + deletions).
Performance Characteristics
- Time Complexity: O(m×bandwidth) where bandwidth adapts to sequence similarity
- Space Complexity: O(n) for two row arrays
- Best Case: Near-identical sequences with narrow band (very fast)
- Worst Case: Divergent sequences requiring wide band (gracefully degrades)
- Sequence Limit: 2,097,151 bp (21-bit position encoding)
Comparison to Other Aligners
ScrabbleAligner vs WobbleAligner: Both use adaptive banding with drift, but ScrabbleAligner uses scalar variables for band tracking instead of WobbleAligner's ring buffer approach. The scalar design simplifies the implementation while maintaining comparable performance.
Integration with Visualization: ScrabbleAligner's map output format is compatible with visualizealignment.sh, enabling visual exploration of the adaptive banding pathway through the alignment matrix.
Support
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
For documentation and the latest version, visit: https://bbmap.org