MicroAlign
Wrapper for MicroAligner. Can align reads to a small, single-contig reference like PhiX. Performance optimized for single-contig references through specialized indexing. Produces most of the same histograms, like idhist, mhist, etc. Not currently designed for reference with multiple sequences, or duplicate kmers of length used for indexing.
Basic Usage
microalign.sh in=<input file> out=<output file> ref=<reference>
Input may be fasta or fastq, compressed or uncompressed.
Parameters
Parameters are organized by their function in the alignment process. All parameters from the shell script are preserved in their original groupings.
Standard parameters
- in=<file>
- Primary input, or read 1 input. Can be fasta or fastq, compressed or uncompressed.
- in2=<file>
- Read 2 input if reads are in two files.
- out=<file>
- Primary output, or read 1 output. Aligned reads in SAM format by default.
- out2=<file>
- Read 2 output if reads are in two files.
- outu=<file>
- Optional unmapped read output. Reads that do not align to the reference.
- outu2=<file>
- Optional unmapped read 2 output. Paired reads that do not align to the reference.
- ref=<file>
- Reference sequence file. Should be a small, single-contig reference like PhiX. Required parameter.
Processing parameters
- k=17
- Main kmer length. Used for initial alignment. Also accepts k1 or kbig as aliases.
- k2=13
- Sub-kmer length for paired reads only. Used when attempting to map the mate of a mapped read. Also accepts ksmall as alias.
- minid=0.66
- Minimum alignment identity (0.0-1.0). Reads with lower identity are considered unmapped. Also accepts minid1 as alias.
- minid2=0.56
- Minimum alignment identity if the mate is mapped (0.0-1.0). Allows lower identity threshold for mate mapping.
- mm=1
- Middle mask length; the index uses gapped kmers. Sets both mm1 and mm2 to this value.
- mm1=1
- Middle mask length for k1 kmers. Individual control over primary kmer masking.
- mm2=1
- Middle mask length for k2 kmers. Individual control over secondary kmer masking.
Additional Processing Parameters
- verbose=f
- Print verbose messages during processing. Useful for debugging alignment issues.
- ordered=f
- Output reads in the same order as input. Requires additional memory buffering.
- mappedonly=f
- Only output mapped reads. Unmapped reads are discarded instead of written to main output.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions. May provide minor performance improvement.
Examples
Basic Alignment to PhiX
microalign.sh in=reads.fq out=aligned.sam ref=phix.fa
Align single-end reads to PhiX reference with default parameters.
Paired-End Alignment with Unmapped Output
microalign.sh in=reads_1.fq in2=reads_2.fq out=aligned.sam outu=unmapped.fq ref=phix.fa
Align paired-end reads, writing mapped reads to SAM format and unmapped reads to separate file.
Custom Identity Thresholds
microalign.sh in=reads.fq out=aligned.sam ref=control.fa minid=0.80 minid2=0.70
Use higher identity thresholds for more stringent alignment. Useful for detecting contamination.
Optimized for Very Short Reads
microalign.sh in=reads.fq out=aligned.sam ref=adapter.fa k=11 k2=9 minid=0.50
Use shorter kmers and lower identity threshold for aligning very short reads like adapters.
High Memory Mode with Statistics
microalign.sh in=reads.fq out=aligned.sam ref=phix.fa -Xmx8g verbose=t
Run with high memory allocation and verbose output for detailed alignment statistics.
Algorithm Details
MicroIndex3 Kmer Indexing Architecture
MicroAlign uses single-contig indexing through the MicroIndex3 class implementation:
- LongHashMap storage: Uses LongHashMap(ref.length*2) for kmer-to-position mapping with O(1) lookup performance
- Canonical kmer indexing: Stores Tools.max(kmer, rkmer) as key with strand encoding in value
- Position encoding: Values store position with MINUS_CODE offset (1,000,000,000) for reverse strand hits
- Memory scaling: Index size scales as 2x reference length for load factor optimization
- Gapped kmer masking: Middle mask removes central positions using makeMidMask() bitwise operations
Dual Kmer Strategy Implementation
MicroAlign uses two MicroIndex3 instances with different kmer lengths for complete mapping:
- Primary index (k1=17): Longer kmers provide specificity with index1=new MicroIndex3(k1, mm1, r)
- Secondary index (k2=13): Shorter kmers increase sensitivity with index2=new MicroIndex3(k2, mm2, r)
- Mate rescue logic: Secondary mapper only runs when one mate maps but other fails
- Conditional creation: Secondary index created only if k2 >= 1, otherwise null for single-end mode
MicroAligner3 Alignment Process
Two-stage alignment strategy combining quick alignment with full dynamic programming:
- Quick alignment: First attempts simple base-by-base comparison with quickAlign() method
- Mismatch threshold: Quick mode fails if >3 substitutions or matches < 25% of read length
- Full alignment fallback: Uses SingleStateAlignerFlat2 for gapped alignment with padding
- Identity calculation: Precise identity from match string using Read.identity(match) method
- Coordinate mapping: Sets r.start, r.stop, r.chrom=1 for successful alignments
Gapped Kmer Middle Masking
Bit-field masking implementation for error-tolerant kmer matching:
- Mask calculation: makeMidMask() creates bit pattern masking central positions
- Bit manipulation: Uses ~((~((-1L)<<bits))<<shift) for precise position masking
- Error tolerance: mm=1 allows matching across single-base differences in central position
- Independent control: Separate mm1/mm2 values for different kmer lengths
- Index integration: Applied during both indexing and query phases for consistency
Proper Pair Detection Logic
Precise criteria implementation for paired-end classification:
- Chromosome matching: boolean properPair requires r1.chrom==r2.chrom
- Strand orientation: Checks r1.strand()!=r2.strand() for proper pair geometry
- Distance constraint: Tools.absdif(r1.start, r2.start)<=1000 for insert size limit
- Flag propagation: Sets r1.setPaired(properPair) and r2.setPaired(properPair)
- SAM integration: Proper pair flags transferred to SAM output format
Identity-Based Filtering Implementation
Two-tier threshold system with mathematical precision:
- Primary threshold: minIdentity1=0.66f applied to standalone read alignments
- Mate threshold: minIdentity2=0.56f used when mate already mapped successfully
- Rescue strategy: Conditional logic id2=mapper2.map(r2) for mate recovery
- Identity calculation: Float precision using alignment statistics from match string
- Threshold enforcement: Reads below threshold marked as unmapped
Performance Characteristics
- Memory efficiency: LongHashMap with 2x reference scaling vs genome-wide indices
- Threading model: ProcessThread instances with shared index access and thread-local aligners
- Single-contig optimization: No chromosome lookup overhead, direct position mapping
- SIMD compatibility: SingleStateAlignerFlat2 supports vectorized operations where available
- Quick alignment bypass: Avoids DP computation for simple cases with <4 mismatches
Limitations and Design Constraints
- Single contig requirement: MicroIndex3 assumes single reference chromosome with chrom=1
- Duplicate kmer handling: HashMap overwrites duplicate positions, losing multi-mapping information
- Memory scaling: Index size proportional to reference length, unsuitable for large genomes
- Threading bottleneck: Shared index access could limit scalability with many threads
- Kmer length constraints: Maximum k=32 due to long integer bit limitations
Output Formats
SAM Output
Default output format includes standard SAM fields with MicroAlign-specific features:
- MAPQ scores: Based on alignment identity and uniqueness
- Proper pair flags: Set according to distance and orientation criteria
- Identity information: Encoded in alignment string and optional fields
Statistics Output
Standard statistics include:
- Aligned Reads: Count and percentage of successfully mapped reads
- Aligned Bases: Total bases in mapped reads
- Average Identity: Mean identity score across all alignments
Use Cases
Quality Control
- PhiX contamination: Detect Illumina PhiX spike-in contamination
- Adapter detection: Identify adapter sequences in reads
- Control alignment: Align to known control sequences
Small Reference Alignment
- Plasmid alignment: Map reads to plasmid sequences
- Viral genomes: Align to small viral reference genomes
- Amplicon analysis: Map PCR amplicon sequences
Performance Testing
- Pipeline validation: Quick alignment for pipeline testing
- Parameter optimization: Test alignment parameters before large runs
- Quality assessment: Rapid quality check of sequencing runs
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org