KMutate

Script: kmutate.sh Package: jgi Class: KExpand.java

Given a reference, generates a kmer spectrum including a specified number of substitutions, insertions, and deletions. The output is useful for analyzing barcodes or other short oligos, and filtering using BBDuk or Seal. Input may be fasta or fastq, compressed or raw.

Basic Usage

kmutate.sh in=<file> out=<file> k=<kmer length> edist=<edit distance>

KMutate reads sequences from the input file and generates all possible kmers of the specified length, along with their mutant variants within the given edit distance or specific substitution/insertion/deletion constraints.

Examples

Basic Edit Distance Mode

kmutate.sh in=x.fa out=y.fa k=31 edist=2

Generates all 31-mers in x.fa, along with all 31-mer mutants with an edit distance of up to 2. For example, 1 substitution, or 1 substitution plus 1 deletion, or any other combination of substitutions, insertions, and deletions that sum to 0, 1, or 2.

SDI Mode (Substitution-Deletion-Insertion)

kmutate.sh in=x.fa out=y.fa k=31 idist=1 ddist=3

Generates all 31-mers in x.fa, along with all 31-mer mutants allowing up to 1 insertion and 3 deletions, but no substitutions. For example, 1 insertion and 3 deletions is possible (edit distance 4), but 1 deletion and 1 substitution is not directly possible (though some equivalent mutants would still be generated because a deletion and subsequent insertion is equivalent to a substitution).

Important Notes

Deletion Event Limitations: Deletion events have limitations; e.g., they cannot occur on input sequences of length k because the resulting sequence is shorter than k. As a result, running the program twice consecutively with edist=1 will not give the same results as running once with edist=2 since the intermediate file will be only k-length sequences. However, running consecutively with sdist or idist would be equivalent since they do not involve deletions.

Mode Selection: Use SDI mode flags (sdist, idist, ddist) OR Edit mode flag (edist), not both. Both modes are equivalent, they just have different defaults.

Parameters

Standard Parameters

in=<file>
Primary input, or read 1 input.
in2=<file>
Read 2 input if reads are in two files.
out=<file>
Primary output, or read 1 output.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file.

Processing Parameters

k=31
Kmer length; 1-31.
rcomp=t
Consider kmers equivalent to their reverse-complements.

Edit Mode Parameters (used if edist>0)

edist=0
Set the maximal edit distance (0-3). When set, allows any combination of substitutions, insertions, and deletions up to this total distance.
smax=99
(optional) Don't allow more than this many total substitutions.
dmax=99
(optional) Don't allow more than this many total deletions.
imax=99
(optional) Don't allow more than this many total insertions.

SDI Mode Parameters

sdist=0
Maximum substitutions allowed.
idist=0
Maximum insertions allowed.
ddist=0
Maximum deletions allowed (0-3).
emax=99
(optional) Don't allow more than this many total edits.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Algorithm Details

KMutate uses recursive edit operations implemented in the mutateE method to systematically generate mutant variants within specified edit distance constraints.

Core Recursive Mutation Strategy

The tool implements a recursive mutation algorithm in the mutateE method that generates all possible combinations of edits within the specified constraints:

LongListSet Storage Architecture

Generated kmers are stored in a LongListSet data structure that:

  • Automatically sorts and removes duplicates using shrinkToUnique() method
  • Provides space-efficient storage for large kmer sets using compressed long arrays
  • Supports iterator-based access via LongListSetIterator for output generation

Binary Kmer Encoding

The algorithm uses binary encoding for kmer manipulation based on alphabet type:

  • Nucleotides: 2 bits per base (bitsPerBase=2) with A=0, C=1, G=2, T=3 using symbolToNumber0 arrays
  • Amino Acids: 5 bits per residue (bitsPerBase=5) supporting 20 standard amino acids via AminoAcid.symbolToNumber()
  • Reverse Complement: When rcomp=t, uses canonical representation via Tools.max(kmer, rkmer) after computing reverse complement with AminoAcid.reverseComplementBinaryFast()

Bit Manipulation Implementation

The mutation process uses precomputed bit masks for kmer modification:

  • Clear Masks: clearMasks[i]=~(symbolMask<<(bitsPerBase*i)) zeros out position i for substitutions
  • Set Masks: setMasks[j][i]=(j<<(bitsPerBase*i)) sets position i to symbol j
  • Shift Operations: leftMasks[i]=((-1L)<<(bitsPerBase*i)) and rightMasks[i]=~((-1L)<<(bitsPerBase*i)) enable deletion and insertion operations

Context-Aware Deletions

Deletion operations require additional sequence context (extraBase, extraBase2, extraBase3) to maintain kmer length. The algorithm looks ahead in the sequence to gather these bases, which explains why deletion behavior differs when run consecutively versus in a single pass.

Hash-Based Performance Optimizations

  • Speed Parameter: Uses modular arithmetic filtering ((key&Long.MAX_VALUE)%17) to process only a fraction of kmers when speed > 0
  • Skip Parameter: Processes every nth kmer when len%skip==0 for skip > 1
  • Early Termination: Respects maximum edit constraints (maxSubs, maxDels, maxInss) to avoid excessive recursion depth

Performance Characteristics

  • Memory Usage: Default 4GB allocation via freeRam 4000m 84 in calcXmx(), scales with kmer count and edit distance
  • Time Complexity: Exponential in edit distance due to combinatorial explosion of mutations
  • Output Size: Can grow very large with high edit distances - use caution with edist > 2
  • Thread Safety: Single-threaded processing with buffer capping via Shared.capBuffers(4)

Related Tools

  • kcompress: For kmer compression and analysis
  • kmercountexact: For exact kmer counting
  • bbkmerset: For kmer set operations
  • BBDuk: Can use KMutate output for flexible sequence filtering
  • Seal: Can use KMutate output for sequence matching