IndelFree

Script: indelfree.sh Package: aligner Class: IndelFreeAligner.java

Aligns sequences, not allowing indels. Brute force mode guarantees all alignments will be found and reported, up to the maximum allowed number of substitutions. Indexed mode may remove this guarantee (depending on kmer length, query length, and number of substitutions) but can be much faster. This loads all reads into memory and streams the reference, unlike a traditional aligner, so it is designed for a relatively small query set and potentially enormous reference set.

Basic Usage

indelfree.sh in=spacers.fa ref=contigs.fa out=mapped.sam

IndelFree is specifically designed for scenarios where you have a small set of query sequences that need to be aligned against one or more large reference sequences without allowing insertions or deletions. This is particularly useful for applications like CRISPR spacer matching, primer design validation, or other exact sequence matching tasks where structural variations are not expected.

Parameters

Parameters are organized by their function in the alignment process. IndelFree provides both brute force and indexed alignment modes, with the indexed mode using k-mer seed matching for improved performance on large references.

Core Parameters

in=<file>
Query input file containing sequences to be aligned. These sequences will be loaded into memory for efficient processing. Supports FASTA and FASTQ formats, optionally compressed with gzip or bzip2. For paired-end queries, use the # symbol (e.g., reads#.fq) to specify both files.
ref=<file>
Reference input file containing sequences to align against. Reference sequences are streamed from disk to minimize memory usage, making this suitable for very large reference genomes. Supports FASTA and FASTQ formats, optionally compressed.
out=<file>
Output file path for SAM format alignment results. Contains all valid alignments with proper CIGAR strings, mapping quality scores, and optional tags including edit distance (NM tag). Use stdout for console output.
subs=5
Maximum allowed substitutions per alignment. This includes both true mismatches and alignments involving ambiguous nucleotides (N's). Higher values increase sensitivity but reduce specificity and performance. Set to 0 for exact matches only.
simd
Enable SIMD vectorization using AVX2/SSE instructions for diagonal alignment. This significantly accelerates brute force mode when query sequences are shorter than 256 bases. Only effective in brute force mode; indexed mode uses different optimization strategies.
threads=
Maximum number of worker threads for parallel processing. Default uses all available logical CPU cores. Each thread processes different reference sequences independently, with work-stealing for optimal load balancing. Memory usage scales approximately linearly with thread count.

Index Parameters

index=t
Enable k-mer indexing for seed-and-extend alignment strategy. When true, builds k-mer indices for both queries and references to identify potential alignment locations before performing full alignment. Generally faster for longer references but may miss some alignments depending on k-mer parameters.
k=13
K-mer length for seed matching (range: 1-15). Longer k-mers provide faster searching with reduced sensitivity, while shorter k-mers are more sensitive but slower. K-mers shorter than 6-8 may be slower than brute force mode. Automatically enables indexing when k > 0.
mm=1
Middle mask length specifying the number of wildcard positions in the center of each k-mer. This allows fuzzy matching to tolerate substitutions in the middle of k-mers. Must be less than k-1. Set to 0 to disable middle masking for exact k-mer matching only.
blacklist=2
Exclude homopolymer k-mers up to this repeat length from the index. Homopolymer k-mers (e.g., AAAA, TTTT) occur frequently and provide poor specificity for seed matching. Higher values remove more repetitive k-mers, improving search precision but potentially reducing sensitivity in repetitive regions.
step=1
Sampling interval for query k-mers. Only every Nth k-mer from query sequences will be used for seed matching. step=1 uses every k-mer (maximum sensitivity), step=2 uses every other k-mer, etc. Higher values reduce computational cost but may miss alignments with few matching k-mers.
minhits=1
Minimum number of seed hits required before attempting full alignment at a position. Higher values reduce false positive alignments and improve performance by filtering out unlikely alignment locations. Should be balanced with the expected number of matching k-mers for valid alignments.
minprob=0.9999
Probability threshold for calculating adaptive minimum hits based on query length and substitution tolerance. The algorithm calculates the expected number of matching k-mers for a valid alignment and sets the minimum hits to achieve this probability of detection. Set to 1.0 for maximum sensitivity, 0 to require all possible seed hits, or negative values to disable and use only the minhits parameter.
prescan=t
Perform fast pre-screening to count shared k-mers between query and reference before building full seed location lists. This optimization quickly identifies query-reference pairs with insufficient shared k-mers to meet the minimum hits threshold, avoiding expensive seed matching for impossible alignments.
list=t
Store seed hits in lists rather than hash maps. Lists are more memory-efficient and faster for most scenarios. Maps are optimized for cases with very short k-mers (k < 8) and high hit densities, but generally list-based storage provides better performance.

Java Parameters

-Xmx
Set Java heap memory limit, overriding automatic detection. Specify as -Xmx20g for 20 gigabytes or -Xmx200m for 200 megabytes. IndelFree loads all query sequences into memory, so memory requirements depend on query set size. The maximum recommended setting is typically 85% of available physical memory.
-eoom
Exit immediately if an out-of-memory exception occurs, rather than attempting recovery. This prevents corrupted output files and long delays when insufficient memory is available. Requires Java 8u92 or later versions.
-da
Disable Java assertions for slightly improved performance. Assertions perform additional validation checks that can help identify software bugs but add computational overhead. Generally safe to disable for production workflows.

Examples

Basic CRISPR Spacer Matching

indelfree.sh in=spacers.fa ref=viral_genomes.fa out=matches.sam subs=2

Align CRISPR spacers against viral genomes allowing up to 2 substitutions. This is useful for identifying potential target sites for CRISPR systems.

Exact Primer Validation

indelfree.sh in=primers.fa ref=genome.fa out=primer_sites.sam subs=0

Find exact matches for PCR primers in a reference genome. Setting subs=0 requires perfect matches with no mismatches.

High-Sensitivity Small Query Search

indelfree.sh in=motifs.fa ref=sequences.fa out=motif_hits.sam index=f simd threads=8

Use brute force mode with SIMD acceleration for comprehensive search of short motifs. Disabling indexing ensures all possible alignments are found.

Large-Scale Indexed Search

indelfree.sh in=queries.fa ref=large_genome.fa out=alignments.sam k=15 mm=0 minhits=3

High-performance search using longer exact k-mers (mm=0) with increased selectivity (minhits=3) for processing large reference genomes efficiently.

Fuzzy Matching with Middle Masking

indelfree.sh in=degraded_reads.fa ref=reference.fa out=fuzzy_matches.sam k=13 mm=2 subs=3

Allow substitutions in the middle of k-mers (mm=2) to handle degraded or error-prone sequences while still maintaining alignment specificity.

Algorithm Details

IndelFree implements a specialized alignment algorithm using two distinct operational modes based on the IndelFreeAligner.java implementation with producer-consumer pattern for reference sequence processing:

Brute Force Mode (index=false)

The alignAllPositions() method tests every possible alignment position systematically across the entire reference sequence using definitive position enumeration:

Indexed Mode (index=true)

Implements a seed-and-extend strategy using k-mer indexing for improved performance on large references:

Query Preprocessing

Reference Processing

Seed Matching and Extension

Memory Architecture

IndelFree uses an inverted memory model compared to traditional aligners:

Output Generation

Performance Characteristics

Support

For questions and support: