MicroAlign

Script: microalign.sh Package: aligner Class: MicroWrapper.java

Wrapper for MicroAligner. Can align reads to a small, single-contig reference like PhiX. Performance optimized for single-contig references through specialized indexing. Produces most of the same histograms, like idhist, mhist, etc. Not currently designed for reference with multiple sequences, or duplicate kmers of length used for indexing.

Basic Usage

microalign.sh in=<input file> out=<output file> ref=<reference>

Input may be fasta or fastq, compressed or uncompressed.

Parameters

Parameters are organized by their function in the alignment process. All parameters from the shell script are preserved in their original groupings.

Standard parameters

in=<file>
Primary input, or read 1 input. Can be fasta or fastq, compressed or uncompressed.
in2=<file>
Read 2 input if reads are in two files.
out=<file>
Primary output, or read 1 output. Aligned reads in SAM format by default.
out2=<file>
Read 2 output if reads are in two files.
outu=<file>
Optional unmapped read output. Reads that do not align to the reference.
outu2=<file>
Optional unmapped read 2 output. Paired reads that do not align to the reference.
ref=<file>
Reference sequence file. Should be a small, single-contig reference like PhiX. Required parameter.

Processing parameters

k=17
Main kmer length. Used for initial alignment. Also accepts k1 or kbig as aliases.
k2=13
Sub-kmer length for paired reads only. Used when attempting to map the mate of a mapped read. Also accepts ksmall as alias.
minid=0.66
Minimum alignment identity (0.0-1.0). Reads with lower identity are considered unmapped. Also accepts minid1 as alias.
minid2=0.56
Minimum alignment identity if the mate is mapped (0.0-1.0). Allows lower identity threshold for mate mapping.
mm=1
Middle mask length; the index uses gapped kmers. Sets both mm1 and mm2 to this value.
mm1=1
Middle mask length for k1 kmers. Individual control over primary kmer masking.
mm2=1
Middle mask length for k2 kmers. Individual control over secondary kmer masking.

Additional Processing Parameters

verbose=f
Print verbose messages during processing. Useful for debugging alignment issues.
ordered=f
Output reads in the same order as input. Requires additional memory buffering.
mappedonly=f
Only output mapped reads. Unmapped reads are discarded instead of written to main output.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions. May provide minor performance improvement.

Examples

Basic Alignment to PhiX

microalign.sh in=reads.fq out=aligned.sam ref=phix.fa

Align single-end reads to PhiX reference with default parameters.

Paired-End Alignment with Unmapped Output

microalign.sh in=reads_1.fq in2=reads_2.fq out=aligned.sam outu=unmapped.fq ref=phix.fa

Align paired-end reads, writing mapped reads to SAM format and unmapped reads to separate file.

Custom Identity Thresholds

microalign.sh in=reads.fq out=aligned.sam ref=control.fa minid=0.80 minid2=0.70

Use higher identity thresholds for more stringent alignment. Useful for detecting contamination.

Optimized for Very Short Reads

microalign.sh in=reads.fq out=aligned.sam ref=adapter.fa k=11 k2=9 minid=0.50

Use shorter kmers and lower identity threshold for aligning very short reads like adapters.

High Memory Mode with Statistics

microalign.sh in=reads.fq out=aligned.sam ref=phix.fa -Xmx8g verbose=t

Run with high memory allocation and verbose output for detailed alignment statistics.

Algorithm Details

MicroIndex3 Kmer Indexing Architecture

MicroAlign uses single-contig indexing through the MicroIndex3 class implementation:

Dual Kmer Strategy Implementation

MicroAlign uses two MicroIndex3 instances with different kmer lengths for complete mapping:

MicroAligner3 Alignment Process

Two-stage alignment strategy combining quick alignment with full dynamic programming:

Gapped Kmer Middle Masking

Bit-field masking implementation for error-tolerant kmer matching:

Proper Pair Detection Logic

Precise criteria implementation for paired-end classification:

Identity-Based Filtering Implementation

Two-tier threshold system with mathematical precision:

Performance Characteristics

Limitations and Design Constraints

Output Formats

SAM Output

Default output format includes standard SAM fields with MicroAlign-specific features:

Statistics Output

Standard statistics include:

Use Cases

Quality Control

Small Reference Alignment

Performance Testing

Support

For questions and support: