MSA
Aligns a query sequence to reference sequences. Outputs the best matching position per reference sequence. If there are multiple queries, only the best-matching query will be used. MSA in this context stands for MultiStateAligner, not Multiple Sequence Alignment.
Basic Usage
msa.sh in=<file> out=<file> literal=<literal,literal,...>
or
msa.sh in=<file> out=<file> ref=<file>
MSA aligns query sequences (provided as literals or from a reference file) to input reads, reporting the best matching position per reference sequence in SAM format.
Parameters
Parameters are organized by their function in the alignment process. MSA requires either literal sequences or a reference file for alignment targets.
Input/Output Parameters
- in=<file>
- File containing reads to align against the reference sequences. Accepts FASTA or FASTQ format, gzipped files supported.
- out=<file>
- SAM output file containing alignment results. Each read gets aligned to all reference sequences, with only the best match reported.
- literal=<sequence1,sequence2,...>
- A sequence of bases to match, or a comma-delimited list of sequences. Use this to specify reference sequences directly in the command line. Do not use with ref parameter.
- ref=<file>
- A FASTA file containing reference sequences to align against. Please set either ref or literal, not both.
- primer=<sequence_or_file>
- Alias for ref parameter. If the value is an existing file, treated as ref; otherwise treated as literal sequence.
- query=<sequence_or_file>
- Alias for ref parameter. If the value is an existing file, treated as ref; otherwise treated as literal sequence.
Alignment Parameters
- rcomp=t
- Also look for reverse-complements of the reference sequences. Default: true. When enabled, MSA will align reads against both forward and reverse-complement orientations of each reference.
- addr=f
- Add r_ prefix to reverse-complemented alignments. Default: false. When true, reverse-complement matches will have "r_" prepended to the reference name in output.
- replicate=f
- Make copies of sequences with undefined bases for every possible combination. Default: false. For example, ATN would expand to ATA, ATC, ATG, and ATT, creating separate alignments for each possibility.
- cutoff=0
- Ignore alignments with identity below this threshold (range 0-1). Default: 0 (no filtering). Set to 0.8 to require 80% identity, for example.
- swap=f
- Swap the reference and query roles. Default: false. When true, reports read alignments to the reference instead of reference alignments to the reads. Changes the output format perspective.
Algorithm Selection Parameters
- usemsa2=f / msa2=f
- Use MultiStateAligner9PacBioAdapter2 algorithm. Default: false. Best for handling indels but slowest option. Recommended for long reads or when indel accuracy is critical.
- usessa2=t / ssa2=t
- Use SingleStateAlignerFlat2 algorithm. Default: true. Fastest option for 16S rRNA sequences. Optimized for speed with good accuracy for most applications.
- usessa1d=f / ssa1d=f
- Use SingleStateAlignerFlat2_1D algorithm. Default: false. Fastest option for 23S rRNA sequences. Specialized for longer ribosomal sequences.
Output Formatting Parameters
- idhist=<file>
- Output file for identity histogram. Records the distribution of alignment identity scores as a histogram for quality assessment.
- printzeros=t
- Print zero counts in histogram output. Default: true. When false, identity bins with zero counts are omitted from histogram output.
- twocolumn=f / 2column=f
- Use two-column output format for histogram. Default: false. When true, outputs "ID Count" format instead of single column.
- onecolumn=t
- Use single-column output format for histogram. Default: true. Outputs only count values, one per line.
Java Parameters
- -Xmx<memory>
- Set Java's maximum memory usage, overriding automatic memory detection. Examples: -Xmx20g for 20 gigabytes, -Xmx200m for 200 megabytes. Maximum is typically 85% of physical memory.
- -eoom
- Exit on out-of-memory exception. Requires Java 8u92+. Causes the process to terminate cleanly when memory is exhausted rather than hanging.
- -da
- Disable Java assertions. May provide minor performance improvement in production use.
Examples
Align reads to a literal sequence
msa.sh in=reads.fq out=alignments.sam literal=ATCGATCGATCG
Aligns reads from reads.fq to the sequence ATCGATCGATCG, outputting results in SAM format.
Align to multiple literal sequences
msa.sh in=reads.fq out=alignments.sam literal=ATCGATCG,GCTAGCTA,TTGGCCAA
Aligns reads against three different sequences, reporting the best match for each read.
Align to sequences from FASTA file
msa.sh in=reads.fq out=alignments.sam ref=primers.fasta
Uses sequences from primers.fasta as alignment targets.
High-stringency alignment with identity filtering
msa.sh in=reads.fq out=alignments.sam ref=targets.fasta cutoff=0.95 rcomp=f
Requires 95% identity for reported alignments and disables reverse-complement matching.
Generate identity histogram
msa.sh in=reads.fq out=alignments.sam literal=ATCGATCG idhist=identity.hist
Creates an identity histogram file showing the distribution of alignment scores.
Handle ambiguous bases with replication
msa.sh in=reads.fq out=alignments.sam literal=ATCNGATCG replicate=t
Expands the sequence with N to all possible combinations (ATCAGATCG, ATCCGATCG, ATCGGATCG, ATCTGATCG).
Algorithm Details
MSA (MultiStateAligner) implements dynamic programming alignment using the FindPrimers.java class with configurable algorithm selection based on sequence type and accuracy requirements.
Aligner Selection Strategy
MSA instantiates alignment algorithms through conditional selection logic in ProcessThread.msa (lines 555-556):
- SingleStateAlignerFlat2 (default): Selected when useSSA2=true. Implements flat scoring matrix without complex state management, optimized for 16S rRNA sequences through streamlined traceback and identity calculation.
- SingleStateAlignerFlat2_1D: Selected when useSSA1D=true. Uses 1D optimization techniques for 23S rRNA sequences, providing specialized performance for longer ribosomal sequences through dimensional reduction.
- MultiStateAligner9PacBioAdapter2: Selected when useMSA2=true. Implements three-matrix architecture (MS, DEL, INS states) with streak-aware scoring using time-based penalties for PacBio long-read data.
- SingleStateAlignerFlat2Amino: Automatically selected when Shared.AMINO_IN=true, providing amino acid sequence alignment with specialized scoring matrices.
Alignment Process
The alignment process follows these implementation steps in FindPrimers.processRead() (lines 394-464):
- Query Processing: Reference sequences loaded via FastaReadInputStream.toReads() (line 127) or literal parsing, with reverse-complement generation using r.reverseComplement() when rcomp=true (line 176).
- Dynamic Programming Alignment: Each read aligned using msa.fillUnlimited() (lines 411, 413) which explores complete matrix without score thresholding, followed by msa.score() for coordinate calculation.
- Best Match Selection: Loop through all reference sequences (line 402), comparing ss.quickScore values to identify maximum scoring alignment using bestSite comparison logic.
- Score Calculation: Identity computed via Read.identity(ss.match) using match string encoding, with identity histogram tracking through idHist.incrementAndGet() (line 473).
- SAM Output Generation: Alignments converted using SamLine.toCigar14() for CIGAR string generation (lines 490, 494) with YI:f identity annotation (line 518).
Memory and Performance Characteristics
MSA implements specific memory management strategies based on source code analysis:
- Memory Scaling: Memory usage determined by maxqlen calculation (lines 133, 146, 159) tracking maximum query length, with aligner instantiation per thread avoiding shared state conflicts.
- Threading: Multithreading via ProcessThread class extending Thread (line 297), with thread count determined by Shared.threads() and work distribution through ThreadWaiter.startAndWait() (line 270).
- Algorithm Performance: Comments indicate SingleStateAlignerFlat2 fastest for 16S (line 561), SingleStateAlignerFlat2_1D fastest for 23S (line 562), MultiStateAligner9PacBioAdapter2 best for indels but slowest (line 560).
- Identity Tracking: Real-time histogram via AtomicIntegerArray idHist (line 607) with concurrent updates during alignment, avoiding post-processing overhead.
Output Format Details
MSA generates SAM output through specific implementation methods:
- Reference Naming: FASTA headers preserved via r.name() or generated as "query"+i for literal sequences (line 152), with header construction in sharedHeader ByteBuilder.
- Coordinate System: 1-based SAM coordinates using Tools.max(0, ss.start)+1 calculation (line 486) from SiteScore start position with proper boundary handling.
- CIGAR Strings: Generated via SamLine.toCigar14() method with match string input, start/stop coordinates, and sequence length parameters for accurate alignment representation.
- Custom Tags: YI:f identity tag generated using bb.append("YI:f:").append(100*identity, 2) with 2-decimal precision (line 518) from Read.identity() calculations.
- Strand Information: SAM flags calculated through makeFlag() method (line 591) using ss.strand() comparison with Shared.MINUS for reverse complement detection.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org