KMutate
Given a reference, generates a kmer spectrum including a specified number of substitutions, insertions, and deletions. The output is useful for analyzing barcodes or other short oligos, and filtering using BBDuk or Seal. Input may be fasta or fastq, compressed or raw.
Basic Usage
kmutate.sh in=<file> out=<file> k=<kmer length> edist=<edit distance>
KMutate reads sequences from the input file and generates all possible kmers of the specified length, along with their mutant variants within the given edit distance or specific substitution/insertion/deletion constraints.
Examples
Basic Edit Distance Mode
kmutate.sh in=x.fa out=y.fa k=31 edist=2
Generates all 31-mers in x.fa, along with all 31-mer mutants with an edit distance of up to 2. For example, 1 substitution, or 1 substitution plus 1 deletion, or any other combination of substitutions, insertions, and deletions that sum to 0, 1, or 2.
SDI Mode (Substitution-Deletion-Insertion)
kmutate.sh in=x.fa out=y.fa k=31 idist=1 ddist=3
Generates all 31-mers in x.fa, along with all 31-mer mutants allowing up to 1 insertion and 3 deletions, but no substitutions. For example, 1 insertion and 3 deletions is possible (edit distance 4), but 1 deletion and 1 substitution is not directly possible (though some equivalent mutants would still be generated because a deletion and subsequent insertion is equivalent to a substitution).
Important Notes
Deletion Event Limitations: Deletion events have limitations; e.g., they cannot occur on input sequences of length k because the resulting sequence is shorter than k. As a result, running the program twice consecutively with edist=1 will not give the same results as running once with edist=2 since the intermediate file will be only k-length sequences. However, running consecutively with sdist or idist would be equivalent since they do not involve deletions.
Mode Selection: Use SDI mode flags (sdist, idist, ddist) OR Edit mode flag (edist), not both. Both modes are equivalent, they just have different defaults.
Parameters
Standard Parameters
- in=<file>
- Primary input, or read 1 input.
- in2=<file>
- Read 2 input if reads are in two files.
- out=<file>
- Primary output, or read 1 output.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
Processing Parameters
- k=31
- Kmer length; 1-31.
- rcomp=t
- Consider kmers equivalent to their reverse-complements.
Edit Mode Parameters (used if edist>0)
- edist=0
- Set the maximal edit distance (0-3). When set, allows any combination of substitutions, insertions, and deletions up to this total distance.
- smax=99
- (optional) Don't allow more than this many total substitutions.
- dmax=99
- (optional) Don't allow more than this many total deletions.
- imax=99
- (optional) Don't allow more than this many total insertions.
SDI Mode Parameters
- sdist=0
- Maximum substitutions allowed.
- idist=0
- Maximum insertions allowed.
- ddist=0
- Maximum deletions allowed (0-3).
- emax=99
- (optional) Don't allow more than this many total edits.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Algorithm Details
KMutate uses recursive edit operations implemented in the mutateE
method to systematically generate mutant variants within specified edit distance constraints.
Core Recursive Mutation Strategy
The tool implements a recursive mutation algorithm in the mutateE
method that generates all possible combinations of edits within the specified constraints:
- Substitution Generation: For each position in the kmer, systematically substitutes each possible symbol using bit operations
(kmer&clearMasks[i])|setMasks[j][i]
where symbols=4 for nucleotides or symbols=21 for amino acids - Deletion Generation: Removes bases at each position using
(kmer&leftMasks[i])|((kmer<
and shifts remaining bases, requiring additional bases from sequence context - Insertion Generation: Inserts each possible symbol at each position using
(kmer&leftMasks[i])|((kmer&rightMasks[i])>>bitsPerBase)
, compressing the kmer by removing the rightmost base
LongListSet Storage Architecture
Generated kmers are stored in a LongListSet
data structure that:
- Automatically sorts and removes duplicates using
shrinkToUnique()
method - Provides space-efficient storage for large kmer sets using compressed long arrays
- Supports iterator-based access via
LongListSetIterator
for output generation
Binary Kmer Encoding
The algorithm uses binary encoding for kmer manipulation based on alphabet type:
- Nucleotides: 2 bits per base (bitsPerBase=2) with A=0, C=1, G=2, T=3 using
symbolToNumber0
arrays - Amino Acids: 5 bits per residue (bitsPerBase=5) supporting 20 standard amino acids via
AminoAcid.symbolToNumber()
- Reverse Complement: When rcomp=t, uses canonical representation via
Tools.max(kmer, rkmer)
after computing reverse complement withAminoAcid.reverseComplementBinaryFast()
Bit Manipulation Implementation
The mutation process uses precomputed bit masks for kmer modification:
- Clear Masks:
clearMasks[i]=~(symbolMask<<(bitsPerBase*i))
zeros out position i for substitutions - Set Masks:
setMasks[j][i]=(j<<(bitsPerBase*i))
sets position i to symbol j - Shift Operations:
leftMasks[i]=((-1L)<<(bitsPerBase*i))
andrightMasks[i]=~((-1L)<<(bitsPerBase*i))
enable deletion and insertion operations
Context-Aware Deletions
Deletion operations require additional sequence context (extraBase, extraBase2, extraBase3) to maintain kmer length. The algorithm looks ahead in the sequence to gather these bases, which explains why deletion behavior differs when run consecutively versus in a single pass.
Hash-Based Performance Optimizations
- Speed Parameter: Uses modular arithmetic filtering
((key&Long.MAX_VALUE)%17)
to process only a fraction of kmers when speed > 0 - Skip Parameter: Processes every nth kmer when
len%skip==0
for skip > 1 - Early Termination: Respects maximum edit constraints (maxSubs, maxDels, maxInss) to avoid excessive recursion depth
Performance Characteristics
- Memory Usage: Default 4GB allocation via
freeRam 4000m 84
in calcXmx(), scales with kmer count and edit distance - Time Complexity: Exponential in edit distance due to combinatorial explosion of mutations
- Output Size: Can grow very large with high edit distances - use caution with edist > 2
- Thread Safety: Single-threaded processing with buffer capping via
Shared.capBuffers(4)
Related Tools
- kcompress: For kmer compression and analysis
- kmercountexact: For exact kmer counting
- bbkmerset: For kmer set operations
- BBDuk: Can use KMutate output for flexible sequence filtering
- Seal: Can use KMutate output for sequence matching