MSA

Script: msa.sh Package: jgi Class: FindPrimers.java

Aligns a query sequence to reference sequences. Outputs the best matching position per reference sequence. If there are multiple queries, only the best-matching query will be used. MSA in this context stands for MultiStateAligner, not Multiple Sequence Alignment.

Basic Usage

msa.sh in=<file> out=<file> literal=<literal,literal,...>

or

msa.sh in=<file> out=<file> ref=<file>

MSA aligns query sequences (provided as literals or from a reference file) to input reads, reporting the best matching position per reference sequence in SAM format.

Parameters

Parameters are organized by their function in the alignment process. MSA requires either literal sequences or a reference file for alignment targets.

Input/Output Parameters

in=<file>
File containing reads to align against the reference sequences. Accepts FASTA or FASTQ format, gzipped files supported.
out=<file>
SAM output file containing alignment results. Each read gets aligned to all reference sequences, with only the best match reported.
literal=<sequence1,sequence2,...>
A sequence of bases to match, or a comma-delimited list of sequences. Use this to specify reference sequences directly in the command line. Do not use with ref parameter.
ref=<file>
A FASTA file containing reference sequences to align against. Please set either ref or literal, not both.
primer=<sequence_or_file>
Alias for ref parameter. If the value is an existing file, treated as ref; otherwise treated as literal sequence.
query=<sequence_or_file>
Alias for ref parameter. If the value is an existing file, treated as ref; otherwise treated as literal sequence.

Alignment Parameters

rcomp=t
Also look for reverse-complements of the reference sequences. Default: true. When enabled, MSA will align reads against both forward and reverse-complement orientations of each reference.
addr=f
Add r_ prefix to reverse-complemented alignments. Default: false. When true, reverse-complement matches will have "r_" prepended to the reference name in output.
replicate=f
Make copies of sequences with undefined bases for every possible combination. Default: false. For example, ATN would expand to ATA, ATC, ATG, and ATT, creating separate alignments for each possibility.
cutoff=0
Ignore alignments with identity below this threshold (range 0-1). Default: 0 (no filtering). Set to 0.8 to require 80% identity, for example.
swap=f
Swap the reference and query roles. Default: false. When true, reports read alignments to the reference instead of reference alignments to the reads. Changes the output format perspective.

Algorithm Selection Parameters

usemsa2=f / msa2=f
Use MultiStateAligner9PacBioAdapter2 algorithm. Default: false. Best for handling indels but slowest option. Recommended for long reads or when indel accuracy is critical.
usessa2=t / ssa2=t
Use SingleStateAlignerFlat2 algorithm. Default: true. Fastest option for 16S rRNA sequences. Optimized for speed with good accuracy for most applications.
usessa1d=f / ssa1d=f
Use SingleStateAlignerFlat2_1D algorithm. Default: false. Fastest option for 23S rRNA sequences. Specialized for longer ribosomal sequences.

Output Formatting Parameters

idhist=<file>
Output file for identity histogram. Records the distribution of alignment identity scores as a histogram for quality assessment.
printzeros=t
Print zero counts in histogram output. Default: true. When false, identity bins with zero counts are omitted from histogram output.
twocolumn=f / 2column=f
Use two-column output format for histogram. Default: false. When true, outputs "ID Count" format instead of single column.
onecolumn=t
Use single-column output format for histogram. Default: true. Outputs only count values, one per line.

Java Parameters

-Xmx<memory>
Set Java's maximum memory usage, overriding automatic memory detection. Examples: -Xmx20g for 20 gigabytes, -Xmx200m for 200 megabytes. Maximum is typically 85% of physical memory.
-eoom
Exit on out-of-memory exception. Requires Java 8u92+. Causes the process to terminate cleanly when memory is exhausted rather than hanging.
-da
Disable Java assertions. May provide minor performance improvement in production use.

Examples

Align reads to a literal sequence

msa.sh in=reads.fq out=alignments.sam literal=ATCGATCGATCG

Aligns reads from reads.fq to the sequence ATCGATCGATCG, outputting results in SAM format.

Align to multiple literal sequences

msa.sh in=reads.fq out=alignments.sam literal=ATCGATCG,GCTAGCTA,TTGGCCAA

Aligns reads against three different sequences, reporting the best match for each read.

Align to sequences from FASTA file

msa.sh in=reads.fq out=alignments.sam ref=primers.fasta

Uses sequences from primers.fasta as alignment targets.

High-stringency alignment with identity filtering

msa.sh in=reads.fq out=alignments.sam ref=targets.fasta cutoff=0.95 rcomp=f

Requires 95% identity for reported alignments and disables reverse-complement matching.

Generate identity histogram

msa.sh in=reads.fq out=alignments.sam literal=ATCGATCG idhist=identity.hist

Creates an identity histogram file showing the distribution of alignment scores.

Handle ambiguous bases with replication

msa.sh in=reads.fq out=alignments.sam literal=ATCNGATCG replicate=t

Expands the sequence with N to all possible combinations (ATCAGATCG, ATCCGATCG, ATCGGATCG, ATCTGATCG).

Algorithm Details

MSA (MultiStateAligner) implements dynamic programming alignment using the FindPrimers.java class with configurable algorithm selection based on sequence type and accuracy requirements.

Aligner Selection Strategy

MSA instantiates alignment algorithms through conditional selection logic in ProcessThread.msa (lines 555-556):

Alignment Process

The alignment process follows these implementation steps in FindPrimers.processRead() (lines 394-464):

  1. Query Processing: Reference sequences loaded via FastaReadInputStream.toReads() (line 127) or literal parsing, with reverse-complement generation using r.reverseComplement() when rcomp=true (line 176).
  2. Dynamic Programming Alignment: Each read aligned using msa.fillUnlimited() (lines 411, 413) which explores complete matrix without score thresholding, followed by msa.score() for coordinate calculation.
  3. Best Match Selection: Loop through all reference sequences (line 402), comparing ss.quickScore values to identify maximum scoring alignment using bestSite comparison logic.
  4. Score Calculation: Identity computed via Read.identity(ss.match) using match string encoding, with identity histogram tracking through idHist.incrementAndGet() (line 473).
  5. SAM Output Generation: Alignments converted using SamLine.toCigar14() for CIGAR string generation (lines 490, 494) with YI:f identity annotation (line 518).

Memory and Performance Characteristics

MSA implements specific memory management strategies based on source code analysis:

Output Format Details

MSA generates SAM output through specific implementation methods:

Support

For questions and support: