SmithWaterman

Script: smithwaterman.sh Package: aligner Class: SmithWaterman.java

Aligns a query sequence to a reference using Smith-Waterman algorithm. Finds optimal local alignment by resetting negative scores to zero. The sequences can be any characters, but N is a special case. Outputs the identity, rstart, and rstop positions. Optionally prints a state space exploration map that can be fed to visualizealignment.sh to make an image.

Basic Usage

smithwaterman.sh <query> <ref>
smithwaterman.sh <query> <ref> <map>
smithwaterman.sh <query> <ref> <map> <iterations>

SmithWaterman performs optimal local sequence alignment using the classic Smith-Waterman dynamic programming algorithm. Unlike global alignment algorithms, it finds the best matching subsequence by allowing scores to reset to zero when they become negative.

Parameters

Parameters control the input sequences, optional visualization output, and benchmarking iterations.

Standard Parameters

query
A literal nucleotide sequence or fasta file. This is the sequence being aligned to the reference.
ref
A literal nucleotide sequence or fasta file. This is the reference sequence that the query aligns against.

Optional Parameters

map
Optional output text file for matrix score space. This creates a visualization-ready file showing the dynamic programming matrix exploration. Set to null for benchmarking with no visualization.
iterations
Optional integer for benchmarking multiple iterations. Useful for performance testing by running the alignment algorithm repeatedly.

Java Parameters

-Xmx
Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 200m (fixed allocation).
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Alignment with Literal Sequences

smithwaterman.sh ACGTACGTACGT TACGTACGTAC

Aligns two literal nucleotide sequences and outputs identity and alignment positions.

Alignment with FASTA Files

smithwaterman.sh query.fa reference.fa

Aligns sequences from FASTA files rather than using literal sequences.

Generate Visualization Map

smithwaterman.sh query.fa ref.fa alignment_map.txt

Produces an alignment and saves the score matrix to alignment_map.txt for visualization with visualizealignment.sh.

Benchmarking Performance

smithwaterman.sh query.fa ref.fa null 1000

Runs the alignment 1000 times for performance benchmarking without generating visualization output.

Algorithm Details

Smith-Waterman Local Alignment

The Smith-Waterman algorithm is the classic dynamic programming approach for finding optimal local alignments between sequences. Key characteristics include:

Scoring Scheme

The implementation uses a straightforward scoring system:

Implementation Details

The SmithWaterman class implements several optimizations:

Output Format

The tool outputs:

Visualization Integration

When a map file is specified, the tool generates a state space exploration map showing:

Memory Requirements

Memory usage scales with reference length: approximately O(n) where n is reference length. The default 200MB allocation handles sequences up to millions of bases. The maximum reference length is limited by the 21-bit position field (2,097,151 bases).

Performance Characteristics

Time complexity is O(mn) where m is query length and n is reference length. The branchless implementation and bit-packing optimizations provide significant speedup over naive implementations, making it suitable for interactive use with sequences up to tens of thousands of bases.

Support

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.