KMutate

Script: kmutate.sh Package: jgi Class: KExpand.java

Given a reference, generates a kmer spectrum including a specified number of substitutions, insertions, and deletions. The output is useful for analyzing barcodes or other short oligos, and filtering using BBDuk or Seal. Input may be fasta or fastq, compressed or raw.

Basic Usage

kmutate.sh in=<file> out=<file> k=<kmer length> edist=<edit distance>

KMutate reads sequences from the input file and generates all possible kmers of the specified length, along with their mutant variants within the given edit distance or specific substitution/insertion/deletion constraints.

Examples

Basic Edit Distance Mode

kmutate.sh in=x.fa out=y.fa k=31 edist=2

Generates all 31-mers in x.fa, along with all 31-mer mutants with an edit distance of up to 2. For example, 1 substitution, or 1 substitution plus 1 deletion, or any other combination of substitutions, insertions, and deletions that sum to 0, 1, or 2.

SDI Mode (Substitution-Deletion-Insertion)

kmutate.sh in=x.fa out=y.fa k=31 idist=1 ddist=3

Generates all 31-mers in x.fa, along with all 31-mer mutants allowing up to 1 insertion and 3 deletions, but no substitutions. For example, 1 insertion and 3 deletions is possible (edit distance 4), but 1 deletion and 1 substitution is not directly possible (though some equivalent mutants would still be generated because a deletion and subsequent insertion is equivalent to a substitution).

Important Notes

Deletion Event Limitations: Deletion events have limitations; e.g., they cannot occur on input sequences of length k because the resulting sequence is shorter than k. As a result, running the program twice consecutively with edist=1 will not give the same results as running once with edist=2 since the intermediate file will be only k-length sequences. However, running consecutively with sdist or idist would be equivalent since they do not involve deletions.

Mode Selection: Use SDI mode flags (sdist, idist, ddist) OR Edit mode flag (edist), not both. Both modes are equivalent, they just have different defaults.

Parameters

Standard Parameters

in=<file>: Primary input, or read 1 input.
in2=<file>: Read 2 input if reads are in two files.
out=<file>: Primary output, or read 1 output.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file.

Processing Parameters

k=31: Kmer length; 1-31.
rcomp=t: Consider kmers equivalent to their reverse-complements.

Edit Mode Parameters (used if edist>0)

edist=0: Set the maximal edit distance (0-3). When set, allows any combination of substitutions, insertions, and deletions up to this total distance.
smax=99: (optional) Don't allow more than this many total substitutions.
dmax=99: (optional) Don't allow more than this many total deletions.
imax=99: (optional) Don't allow more than this many total insertions.

SDI Mode Parameters

sdist=0: Maximum substitutions allowed.
idist=0: Maximum insertions allowed.
ddist=0: Maximum deletions allowed (0-3).
emax=99: (optional) Don't allow more than this many total edits.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Algorithm Details

KMutate uses recursive edit operations implemented in the mutateE method to systematically generate mutant variants within specified edit distance constraints.

Core Recursive Mutation Strategy

The tool implements a recursive mutation algorithm in the mutateE method that generates all possible combinations of edits within the specified constraints:

Substitution Generation: For each position in the kmer, systematically substitutes each possible symbol using bit operations (kmer&clearMasks[i])|setMasks[j][i] where symbols=4 for nucleotides or symbols=21 for amino acids
Deletion Generation: Removes bases at each position using (kmer&leftMasks[i])|((kmer< and shifts remaining bases, requiring additional bases from sequence context


                Insertion Generation: Inserts each possible symbol at each position using (kmer&leftMasks[i])|((kmer&rightMasks[i])>>bitsPerBase), compressing the kmer by removing the rightmost base



            LongListSet Storage Architecture
            Generated kmers are stored in a LongListSet data structure that:
            
                Automatically sorts and removes duplicates using shrinkToUnique() method
                Provides space-efficient storage for large kmer sets using compressed long arrays
                Supports iterator-based access via LongListSetIterator for output generation
            

            Binary Kmer Encoding
            The algorithm uses binary encoding for kmer manipulation based on alphabet type:
            
                Nucleotides: 2 bits per base (bitsPerBase=2) with A=0, C=1, G=2, T=3 using symbolToNumber0 arrays
                Amino Acids: 5 bits per residue (bitsPerBase=5) supporting 20 standard amino acids via AminoAcid.symbolToNumber()
                Reverse Complement: When rcomp=t, uses canonical representation via Tools.max(kmer, rkmer) after computing reverse complement with AminoAcid.reverseComplementBinaryFast()
            

            Bit Manipulation Implementation
            The mutation process uses precomputed bit masks for kmer modification:
            
                Clear Masks: clearMasks[i]=~(symbolMask<<(bitsPerBase*i)) zeros out position i for substitutions
                Set Masks: setMasks[j][i]=(j<<(bitsPerBase*i)) sets position i to symbol j
                Shift Operations: leftMasks[i]=((-1L)<<(bitsPerBase*i)) and rightMasks[i]=~((-1L)<<(bitsPerBase*i)) enable deletion and insertion operations
            

            Context-Aware Deletions
            Deletion operations require additional sequence context (extraBase, extraBase2, extraBase3) to maintain kmer length. The algorithm looks ahead in the sequence to gather these bases, which explains why deletion behavior differs when run consecutively versus in a single pass.

            Hash-Based Performance Optimizations
            
                Speed Parameter: Uses modular arithmetic filtering ((key&Long.MAX_VALUE)%17) to process only a fraction of kmers when speed > 0

                Skip Parameter: Processes every nth kmer when len%skip==0 for skip > 1
                Early Termination: Respects maximum edit constraints (maxSubs, maxDels, maxInss) to avoid excessive recursion depth



        
        
            Performance Characteristics
            
                Memory Usage: Default 4GB allocation via freeRam 4000m 84 in calcXmx(), scales with kmer count and edit distance
                Time Complexity: Exponential in edit distance due to combinatorial explosion of mutations
                Output Size: Can grow very large with high edit distances - use caution with edist > 2
                Thread Safety: Single-threaded processing with buffer capping via Shared.capBuffers(4)
            
        

        
        
            Related Tools
            
                kcompress: For kmer compression and analysis
                kmercountexact: For exact kmer counting
                bbkmerset: For kmer set operations
                BBDuk: Can use KMutate output for flexible sequence filtering
                Seal: Can use KMutate output for sequence matching