SmithWaterman
Aligns a query sequence to a reference using Smith-Waterman algorithm. Finds optimal local alignment by resetting negative scores to zero. The sequences can be any characters, but N is a special case. Outputs the identity, rstart, and rstop positions. Optionally prints a state space exploration map that can be fed to visualizealignment.sh to make an image.
Basic Usage
smithwaterman.sh <query> <ref>
smithwaterman.sh <query> <ref> <map>
smithwaterman.sh <query> <ref> <map> <iterations>
SmithWaterman performs optimal local sequence alignment using the classic Smith-Waterman dynamic programming algorithm. Unlike global alignment algorithms, it finds the best matching subsequence by allowing scores to reset to zero when they become negative.
Parameters
Parameters control the input sequences, optional visualization output, and benchmarking iterations.
Standard Parameters
- query
- A literal nucleotide sequence or fasta file. This is the sequence being aligned to the reference.
- ref
- A literal nucleotide sequence or fasta file. This is the reference sequence that the query aligns against.
Optional Parameters
- map
- Optional output text file for matrix score space. This creates a visualization-ready file showing the dynamic programming matrix exploration. Set to null for benchmarking with no visualization.
- iterations
- Optional integer for benchmarking multiple iterations. Useful for performance testing by running the alignment algorithm repeatedly.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 200m (fixed allocation).
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Alignment with Literal Sequences
smithwaterman.sh ACGTACGTACGT TACGTACGTAC
Aligns two literal nucleotide sequences and outputs identity and alignment positions.
Alignment with FASTA Files
smithwaterman.sh query.fa reference.fa
Aligns sequences from FASTA files rather than using literal sequences.
Generate Visualization Map
smithwaterman.sh query.fa ref.fa alignment_map.txt
Produces an alignment and saves the score matrix to alignment_map.txt for visualization with visualizealignment.sh.
Benchmarking Performance
smithwaterman.sh query.fa ref.fa null 1000
Runs the alignment 1000 times for performance benchmarking without generating visualization output.
Algorithm Details
Smith-Waterman Local Alignment
The Smith-Waterman algorithm is the classic dynamic programming approach for finding optimal local alignments between sequences. Key characteristics include:
- Local Alignment: Unlike global alignment (Needleman-Wunsch), Smith-Waterman finds the best matching subsequence, not necessarily aligning the entire sequences
- Score Resetting: The critical modification is resetting negative scores to zero, allowing the alignment to start fresh anywhere in the sequences
- Optimal Guarantee: Guaranteed to find the mathematically optimal local alignment under the given scoring scheme
- Global Maximum Tracking: During matrix filling, tracks the global maximum score and its position for local alignment recovery
Scoring Scheme
The implementation uses a straightforward scoring system:
- Match: +1 (when bases match and neither is N)
- Substitution: -1 (when bases differ)
- Insertion: -1 (gap in reference)
- Deletion: -1 (gap in query)
- N Handling: Score of 0 when either base is N (ambiguous nucleotide)
Implementation Details
The SmithWaterman class implements several optimizations:
- Bit-packed Scoring: Uses 64-bit longs with bit fields for position (21 bits), deletions (21 bits), and score (22 bits)
- Two-Row Matrix: Only maintains current and previous rows rather than full matrix, reducing memory from O(mn) to O(n)
- Branchless Operations: Score calculations use conditional expressions rather than if/else for better CPU pipeline performance
- Sequence Swapping: Automatically swaps sequences so query is shorter than reference when possible (only when position vector not needed)
Output Format
The tool outputs:
- Identity: Fraction of aligned bases that match (0.0-1.0)
- rStart: Starting position in reference sequence (0-indexed)
- rStop: Ending position in reference sequence (0-indexed, inclusive)
Visualization Integration
When a map file is specified, the tool generates a state space exploration map showing:
- Score values at each position in the dynamic programming matrix
- Traceback path showing optimal alignment route
- Visual representation suitable for educational purposes or debugging
- Compatible with visualizealignment.sh for graphical rendering
Memory Requirements
Memory usage scales with reference length: approximately O(n) where n is reference length. The default 200MB allocation handles sequences up to millions of bases. The maximum reference length is limited by the 21-bit position field (2,097,151 bases).
Performance Characteristics
Time complexity is O(mn) where m is query length and n is reference length. The branchless implementation and bit-packing optimizations provide significant speedup over naive implementations, making it suitable for interactive use with sequences up to tens of thousands of bases.
Support
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.