IndelFree

Basic Usage

indelfree.sh in=spacers.fa ref=contigs.fa out=mapped.sam

IndelFree is specifically designed for scenarios where you have a small set of query sequences that need to be aligned against one or more large reference sequences without allowing insertions or deletions. This is particularly useful for applications like CRISPR spacer matching, primer design validation, or other exact sequence matching tasks where structural variations are not expected.

Parameters

Parameters are organized by their function in the alignment process. IndelFree provides both brute force and indexed alignment modes, with the indexed mode using k-mer seed matching for improved performance on large references.

Core Parameters

in=<file>: Query input file containing sequences to be aligned. These sequences will be loaded into memory for efficient processing. Supports FASTA and FASTQ formats, optionally compressed with gzip or bzip2. For paired-end queries, use the # symbol (e.g., reads#.fq) to specify both files.
ref=<file>: Reference input file containing sequences to align against. Reference sequences are streamed from disk to minimize memory usage, making this suitable for very large reference genomes. Supports FASTA and FASTQ formats, optionally compressed.
out=<file>: Output file path for SAM format alignment results. Contains all valid alignments with proper CIGAR strings, mapping quality scores, and optional tags including edit distance (NM tag). Use stdout for console output.
outh=<file>: Sam header output (optional). Due to the streaming nature, primary sam output is headerless, but this can be concatenated with the main sam file.
subs=5: Maximum allowed substitutions per alignment. This includes both true mismatches and alignments involving ambiguous nucleotides (N's). Higher values increase sensitivity but reduce specificity and performance. Set to 0 for exact matches only.
simd: Enable SIMD vectorization using AVX2/SSE instructions for diagonal alignment. This significantly accelerates brute force mode when query sequences are shorter than 256 bases. Only effective in brute force mode; indexed mode uses different optimization strategies.
threads=: Maximum number of worker threads for parallel processing. Default uses all available logical CPU cores. Each thread processes different reference sequences independently, with work-stealing for optimal load balancing. Memory usage scales approximately linearly with thread count.

Index Parameters

index=t: Enable k-mer indexing for seed-and-extend alignment strategy. When true, builds k-mer indices for both queries and references to identify potential alignment locations before performing full alignment. Generally faster for longer references but may miss some alignments depending on k-mer parameters.
k=13: K-mer length for seed matching (range: 1-15). Longer k-mers provide faster searching with reduced sensitivity, while shorter k-mers are more sensitive but slower. K-mers shorter than 6-8 may be slower than brute force mode. Automatically enables indexing when k > 0.
mm=1: Middle mask length specifying the number of wildcard positions in the center of each k-mer. This allows fuzzy matching to tolerate substitutions in the middle of k-mers. Must be less than k-1. Set to 0 to disable middle masking for exact k-mer matching only.
blacklist=2: Exclude homopolymer k-mers up to this repeat length from the index. Homopolymer k-mers (e.g., AAAA, TTTT) occur frequently and provide poor specificity for seed matching. Higher values remove more repetitive k-mers, improving search precision but potentially reducing sensitivity in repetitive regions.
step=1: Sampling interval for query k-mers. Only every Nth k-mer from query sequences will be used for seed matching. step=1 uses every k-mer (maximum sensitivity), step=2 uses every other k-mer, etc. Higher values reduce computational cost but may miss alignments with few matching k-mers.
minhits=1: Minimum number of seed hits required before attempting full alignment at a position. Higher values reduce false positive alignments and improve performance by filtering out unlikely alignment locations. Should be balanced with the expected number of matching k-mers for valid alignments.
minprob=0.9999: Probability threshold for calculating adaptive minimum hits based on query length and substitution tolerance. The algorithm calculates the expected number of matching k-mers for a valid alignment and sets the minimum hits to achieve this probability of detection. Set to 1.0 for maximum sensitivity, 0 to require all possible seed hits, or negative values to disable and use only the minhits parameter.
prescan=t: Perform fast pre-screening to count shared k-mers between query and reference before building full seed location lists. This optimization quickly identifies query-reference pairs with insufficient shared k-mers to meet the minimum hits threshold, avoiding expensive seed matching for impossible alignments.
list=t: Store seed hits in lists rather than hash maps. Lists are more memory-efficient and faster for most scenarios. Maps are optimized for cases with very short k-mers (k < 8) and high hit densities, but generally list-based storage provides better performance.

Java Parameters

-Xmx: Set Java heap memory limit, overriding automatic detection. Specify as -Xmx20g for 20 gigabytes or -Xmx200m for 200 megabytes. IndelFree loads all query sequences into memory, so memory requirements depend on query set size. The maximum recommended setting is typically 85% of available physical memory.
-eoom: Exit immediately if an out-of-memory exception occurs, rather than attempting recovery. This prevents corrupted output files and long delays when insufficient memory is available. Requires Java 8u92 or later versions.
-da: Disable Java assertions for slightly improved performance. Assertions perform additional validation checks that can help identify software bugs but add computational overhead. Generally safe to disable for production workflows.

Examples

Basic CRISPR Spacer Matching

indelfree.sh in=spacers.fa ref=viral_genomes.fa out=matches.sam subs=2

Align CRISPR spacers against viral genomes allowing up to 2 substitutions. This is useful for identifying potential target sites for CRISPR systems.

Exact Primer Validation

indelfree.sh in=primers.fa ref=genome.fa out=primer_sites.sam subs=0

Find exact matches for PCR primers in a reference genome. Setting subs=0 requires perfect matches with no mismatches.

High-Sensitivity Small Query Search

indelfree.sh in=motifs.fa ref=sequences.fa out=motif_hits.sam index=f simd threads=8

Use brute force mode with SIMD acceleration for comprehensive search of short motifs. Disabling indexing ensures all possible alignments are found.

Large-Scale Indexed Search

indelfree.sh in=queries.fa ref=large_genome.fa out=alignments.sam k=15 mm=0 minhits=3

High-performance search using longer exact k-mers (mm=0) with increased selectivity (minhits=3) for processing large reference genomes efficiently.

Fuzzy Matching with Middle Masking

indelfree.sh in=degraded_reads.fa ref=reference.fa out=fuzzy_matches.sam k=13 mm=2 subs=3

Allow substitutions in the middle of k-mers (mm=2) to handle degraded or error-prone sequences while still maintaining alignment specificity.

Algorithm Details

IndelFree implements a specialized alignment algorithm using two distinct operational modes based on the IndelFreeAligner.java implementation with producer-consumer pattern for reference sequence processing:

Brute Force Mode (index=false)

The alignAllPositions() method tests every possible alignment position systematically across the entire reference sequence using definitive position enumeration:

SIMDAlignByte.alignDiagonal() Integration: Uses SIMD vectorization through shared.SIMDAlignByte when Shared.SIMD is enabled and query.length<256 or maxSubs<256, calling SIMDAlignByte.alignDiagonal(query, ref, maxSubs, maxClips) for diagonal alignment processing
alignClipped() Method Implementation: Handles query overhangs using alignClipped(query, ref, maxSubs, maxClips, rStart) with leftClip=Math.max(0, -rStart) and rightClip=Math.max(0, rStop1-ref.length) calculations for boundary management
Early Termination via align() Method: Stops alignment calculation when subs>maxSubs using increment logic (q!=r || AminoAcid.baseToNumber[q]<0 ? 1 : 0) counting both mismatches and ambiguous nucleotides

Indexed Mode (index=true)

Implements a seed-and-extend strategy using k-mer indexing for improved performance on large references:

Query Preprocessing

Dual Data Structure Strategy: For each query, pre-computes forward and reverse complement k-mer arrays with rolling hash calculation
Middle Masking: Applies wildcard positions in k-mer centers to tolerate substitutions, using bitwise operations for efficient mask application
Blacklist Filtering: Removes homopolymer k-mers and other repetitive sequences to improve search specificity
Probabilistic Hit Calculation: Calculates expected number of matching k-mers for valid alignments based on query length, k-mer parameters, and substitution tolerance

Reference Processing

Streaming Architecture: Processes reference sequences in batches to minimize memory usage, suitable for genomes larger than available RAM
Dynamic Index Construction: Builds k-mer indices on-the-fly for each reference batch using hash tables optimized for the expected k-mer space
Bidirectional Indexing: Stores both forward and reverse complement k-mers, avoiding duplicate entries for palindromic sequences

Seed Matching and Extension

Adaptive Hit Counting: Uses either list-based or map-based seed hit accumulation depending on expected hit density and k-mer length
Position Filtering: Only performs full alignment at positions with sufficient seed hits, dramatically reducing computational cost
Prescan Optimization: Optional fast pre-screening counts shared k-mers without building full seed lists, quickly eliminating impossible alignments

Memory Architecture

IndelFree uses an inverted memory model compared to traditional aligners:

Query-in-Memory Design: All query sequences are loaded into memory simultaneously, enabling efficient random access during alignment
Reference Streaming: Reference sequences are processed in batches and discarded, supporting arbitrarily large reference sets
Thread-Local Processing: Each worker thread maintains private statistics and processing state, minimizing synchronization overhead
Batch Size Optimization: Reduces default buffer sizes to improve parallelization efficiency when processing many small reference sequences

Output Generation

SAM Compliance: Generates complete SAM format output with proper CIGAR strings, mapping quality scores, and optional tags
Coordinate Conversion: Converts internal 0-based coordinates to 1-based SAM format standards
Edit Distance Calculation: Includes NM tag with accurate edit distance for each alignment
Primary/Secondary Flags: Sets appropriate SAM flags with the first hit marked as primary and subsequent hits as secondary alignments

Performance Characteristics

Time Complexity: O(Q × R × L) for brute force mode, O(Q × R × H × L) for indexed mode where Q=query count, R=reference length, L=query length, H=average hits per position
Memory Usage: Approximately Q × L bytes for queries plus temporary index structures, independent of reference size
Scalability: Linear scaling with thread count for both CPU and memory usage, optimal for high-core systems

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org