SmithWaterman

Script: smithwaterman.sh Package: aligner Class: SmithWaterman.java

Aligns a query sequence to a reference using Smith-Waterman algorithm. Finds optimal local alignment by resetting negative scores to zero. The sequences can be any characters, but N is a special case. Outputs the identity, rstart, and rstop positions. Optionally prints a state space exploration map that can be fed to visualizealignment.sh to make an image.

Basic Usage

smithwaterman.sh <query> <ref>
smithwaterman.sh <query> <ref> <map>
smithwaterman.sh <query> <ref> <map> <iterations>

SmithWaterman performs optimal local sequence alignment using the classic Smith-Waterman dynamic programming algorithm. Unlike global alignment algorithms, it finds the best matching subsequence by allowing scores to reset to zero when they become negative.

Parameters

Parameters control the input sequences, optional visualization output, and benchmarking iterations.

Standard Parameters

query: A literal nucleotide sequence or fasta file. This is the sequence being aligned to the reference.
ref: A literal nucleotide sequence or fasta file. This is the reference sequence that the query aligns against.

Optional Parameters

map: Optional output text file for matrix score space. This creates a visualization-ready file showing the dynamic programming matrix exploration. Set to null for benchmarking with no visualization.
iterations: Optional integer for benchmarking multiple iterations. Useful for performance testing by running the alignment algorithm repeatedly.

Java Parameters

-Xmx: Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 200m (fixed allocation).
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Alignment with Literal Sequences

smithwaterman.sh ACGTACGTACGT TACGTACGTAC

Aligns two literal nucleotide sequences and outputs identity and alignment positions.

Alignment with FASTA Files

smithwaterman.sh query.fa reference.fa

Aligns sequences from FASTA files rather than using literal sequences.

Generate Visualization Map

smithwaterman.sh query.fa ref.fa alignment_map.txt

Produces an alignment and saves the score matrix to alignment_map.txt for visualization with visualizealignment.sh.

Benchmarking Performance

smithwaterman.sh query.fa ref.fa null 1000

Runs the alignment 1000 times for performance benchmarking without generating visualization output.

Algorithm Details

Smith-Waterman Local Alignment

The Smith-Waterman algorithm is the classic dynamic programming approach for finding optimal local alignments between sequences. Key characteristics include:

Local Alignment: Unlike global alignment (Needleman-Wunsch), Smith-Waterman finds the best matching subsequence, not necessarily aligning the entire sequences
Score Resetting: The critical modification is resetting negative scores to zero, allowing the alignment to start fresh anywhere in the sequences
Optimal Guarantee: Guaranteed to find the mathematically optimal local alignment under the given scoring scheme
Global Maximum Tracking: During matrix filling, tracks the global maximum score and its position for local alignment recovery

Scoring Scheme

The implementation uses a straightforward scoring system:

Match: +1 (when bases match and neither is N)
Substitution: -1 (when bases differ)
Insertion: -1 (gap in reference)
Deletion: -1 (gap in query)
N Handling: Score of 0 when either base is N (ambiguous nucleotide)

Implementation Details

The SmithWaterman class implements several optimizations:

Bit-packed Scoring: Uses 64-bit longs with bit fields for position (21 bits), deletions (21 bits), and score (22 bits)
Two-Row Matrix: Only maintains current and previous rows rather than full matrix, reducing memory from O(mn) to O(n)
Branchless Operations: Score calculations use conditional expressions rather than if/else for better CPU pipeline performance
Sequence Swapping: Automatically swaps sequences so query is shorter than reference when possible (only when position vector not needed)

Output Format

The tool outputs:

Identity: Fraction of aligned bases that match (0.0-1.0)
rStart: Starting position in reference sequence (0-indexed)
rStop: Ending position in reference sequence (0-indexed, inclusive)

Visualization Integration

When a map file is specified, the tool generates a state space exploration map showing:

Score values at each position in the dynamic programming matrix
Traceback path showing optimal alignment route
Visual representation suitable for educational purposes or debugging
Compatible with visualizealignment.sh for graphical rendering

Memory Requirements

Memory usage scales with reference length: approximately O(n) where n is reference length. The default 200MB allocation handles sequences up to millions of bases. The maximum reference length is limited by the 21-bit position field (2,097,151 bases).

Performance Characteristics

Time complexity is O(mn) where m is query length and n is reference length. The branchless implementation and bit-packing optimizations provide significant speedup over naive implementations, making it suitable for interactive use with sequences up to tens of thousands of bases.

Support

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.