MicroAlign

Basic Usage

microalign.sh in=<input file> out=<output file> ref=<reference>

Input may be fasta or fastq, compressed or uncompressed.

Parameters

Parameters are organized by their function in the alignment process. All parameters from the shell script are preserved in their original groupings.

Standard parameters

in=<file>: Primary input, or read 1 input. Can be fasta or fastq, compressed or uncompressed.
in2=<file>: Read 2 input if reads are in two files.
out=<file>: Primary output, or read 1 output. Aligned reads in SAM format by default.
out2=<file>: Read 2 output if reads are in two files.
outu=<file>: Optional unmapped read output. Reads that do not align to the reference.
outu2=<file>: Optional unmapped read 2 output. Paired reads that do not align to the reference.
ref=<file>: Reference sequence file. Should be a small, single-contig reference like PhiX. Required parameter.

Processing parameters

k=17: Main kmer length. Used for initial alignment. Also accepts k1 or kbig as aliases.
k2=13: Sub-kmer length for paired reads only. Used when attempting to map the mate of a mapped read. Also accepts ksmall as alias.
minid=0.66: Minimum alignment identity (0.0-1.0). Reads with lower identity are considered unmapped. Also accepts minid1 as alias.
minid2=0.56: Minimum alignment identity if the mate is mapped (0.0-1.0). Allows lower identity threshold for mate mapping.
mm=1: Middle mask length; the index uses gapped kmers. Sets both mm1 and mm2 to this value.
mm1=1: Middle mask length for k1 kmers. Individual control over primary kmer masking.
mm2=1: Middle mask length for k2 kmers. Individual control over secondary kmer masking.

Additional Processing Parameters

verbose=f: Print verbose messages during processing. Useful for debugging alignment issues.
ordered=f: Output reads in the same order as input. Requires additional memory buffering.
mappedonly=f: Only output mapped reads. Unmapped reads are discarded instead of written to main output.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions. May provide minor performance improvement.

Examples

Basic Alignment to PhiX

microalign.sh in=reads.fq out=aligned.sam ref=phix.fa

Align single-end reads to PhiX reference with default parameters.

Paired-End Alignment with Unmapped Output

microalign.sh in=reads_1.fq in2=reads_2.fq out=aligned.sam outu=unmapped.fq ref=phix.fa

Align paired-end reads, writing mapped reads to SAM format and unmapped reads to separate file.

Custom Identity Thresholds

microalign.sh in=reads.fq out=aligned.sam ref=control.fa minid=0.80 minid2=0.70

Use higher identity thresholds for more stringent alignment. Useful for detecting contamination.

Optimized for Very Short Reads

microalign.sh in=reads.fq out=aligned.sam ref=adapter.fa k=11 k2=9 minid=0.50

Use shorter kmers and lower identity threshold for aligning very short reads like adapters.

High Memory Mode with Statistics

microalign.sh in=reads.fq out=aligned.sam ref=phix.fa -Xmx8g verbose=t

Run with high memory allocation and verbose output for detailed alignment statistics.

Algorithm Details

MicroIndex3 Kmer Indexing Architecture

MicroAlign uses single-contig indexing through the MicroIndex3 class implementation:

LongHashMap storage: Uses LongHashMap(ref.length*2) for kmer-to-position mapping with O(1) lookup performance
Canonical kmer indexing: Stores Tools.max(kmer, rkmer) as key with strand encoding in value
Position encoding: Values store position with MINUS_CODE offset (1,000,000,000) for reverse strand hits
Memory scaling: Index size scales as 2x reference length for load factor optimization
Gapped kmer masking: Middle mask removes central positions using makeMidMask() bitwise operations

Dual Kmer Strategy Implementation

MicroAlign uses two MicroIndex3 instances with different kmer lengths for complete mapping:

Primary index (k1=17): Longer kmers provide specificity with index1=new MicroIndex3(k1, mm1, r)
Secondary index (k2=13): Shorter kmers increase sensitivity with index2=new MicroIndex3(k2, mm2, r)
Mate rescue logic: Secondary mapper only runs when one mate maps but other fails
Conditional creation: Secondary index created only if k2 >= 1, otherwise null for single-end mode

MicroAligner3 Alignment Process

Two-stage alignment strategy combining quick alignment with full dynamic programming:

Quick alignment: First attempts simple base-by-base comparison with quickAlign() method
Mismatch threshold: Quick mode fails if >3 substitutions or matches < 25% of read length
Full alignment fallback: Uses SingleStateAlignerFlat2 for gapped alignment with padding
Identity calculation: Precise identity from match string using Read.identity(match) method
Coordinate mapping: Sets r.start, r.stop, r.chrom=1 for successful alignments

Gapped Kmer Middle Masking

Bit-field masking implementation for error-tolerant kmer matching:

Mask calculation: makeMidMask() creates bit pattern masking central positions
Bit manipulation: Uses ~((~((-1L)<<bits))<<shift) for precise position masking
Error tolerance: mm=1 allows matching across single-base differences in central position
Independent control: Separate mm1/mm2 values for different kmer lengths
Index integration: Applied during both indexing and query phases for consistency

Proper Pair Detection Logic

Precise criteria implementation for paired-end classification:

Chromosome matching: boolean properPair requires r1.chrom==r2.chrom
Strand orientation: Checks r1.strand()!=r2.strand() for proper pair geometry
Distance constraint: Tools.absdif(r1.start, r2.start)<=1000 for insert size limit
Flag propagation: Sets r1.setPaired(properPair) and r2.setPaired(properPair)
SAM integration: Proper pair flags transferred to SAM output format

Identity-Based Filtering Implementation

Two-tier threshold system with mathematical precision:

Primary threshold: minIdentity1=0.66f applied to standalone read alignments
Mate threshold: minIdentity2=0.56f used when mate already mapped successfully
Rescue strategy: Conditional logic id2=mapper2.map(r2) for mate recovery
Identity calculation: Float precision using alignment statistics from match string
Threshold enforcement: Reads below threshold marked as unmapped

Performance Characteristics

Memory efficiency: LongHashMap with 2x reference scaling vs genome-wide indices
Threading model: ProcessThread instances with shared index access and thread-local aligners
Single-contig optimization: No chromosome lookup overhead, direct position mapping
SIMD compatibility: SingleStateAlignerFlat2 supports vectorized operations where available
Quick alignment bypass: Avoids DP computation for simple cases with <4 mismatches

Limitations and Design Constraints

Single contig requirement: MicroIndex3 assumes single reference chromosome with chrom=1
Duplicate kmer handling: HashMap overwrites duplicate positions, losing multi-mapping information
Memory scaling: Index size proportional to reference length, unsuitable for large genomes
Threading bottleneck: Shared index access could limit scalability with many threads
Kmer length constraints: Maximum k=32 due to long integer bit limitations

Output Formats

SAM Output

Default output format includes standard SAM fields with MicroAlign-specific features:

MAPQ scores: Based on alignment identity and uniqueness
Proper pair flags: Set according to distance and orientation criteria
Identity information: Encoded in alignment string and optional fields

Statistics Output

Standard statistics include:

Aligned Reads: Count and percentage of successfully mapped reads
Aligned Bases: Total bases in mapped reads
Average Identity: Mean identity score across all alignments

Use Cases

Quality Control

PhiX contamination: Detect Illumina PhiX spike-in contamination
Adapter detection: Identify adapter sequences in reads
Control alignment: Align to known control sequences

Small Reference Alignment

Plasmid alignment: Map reads to plasmid sequences
Viral genomes: Align to small viral reference genomes
Amplicon analysis: Map PCR amplicon sequences

Performance Testing

Pipeline validation: Quick alignment for pipeline testing
Parameter optimization: Test alignment parameters before large runs
Quality assessment: Rapid quality check of sequencing runs

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org