InvertKey
Inverts a sketch key, given a matching reference. This tool takes a reference sequence and a sketch key (or set of keys) and finds the k-mer sequences that generated those hash values.
Basic Usage
invertkey.sh in=<reference> key=<key> k=<31>
The reference file must contain the sequences that the sketch keys were generated from. The k-mer length must match the k used to generate the original sketch.
Parameters
Parameters are organized by their function in the key inversion process. All parameters from the shell script usage are documented below.
I/O parameters
- in=<file>
- Input reference file containing sequences to search for k-mers. Required parameter.
- key=<key>
- Sketch key or comma-separated list of keys to invert. Can also be a .sketch file. Required parameter.
- k=<31>
- K-mer length for matching. Must match the k value used to generate the original sketch. Default: 31
- out=<file>
- Output file for inverted k-mers. Default: stdout.fa
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
Processing parameters
- verbose=f
- Enable verbose output for debugging and monitoring progress.
- printonce=t
- When true, stops after finding the first occurrence of each key. When false, reports all occurrences.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Key Inversion
invertkey.sh in=reference.fasta key=A12B34C56D78 k=31
Inverts a single sketch key using a reference FASTA file with k=31.
Multiple Key Inversion
invertkey.sh in=reference.fasta key=A12B34C56D78,X98Y76Z54W32 k=31 out=inverted.fasta
Inverts multiple comma-separated sketch keys and outputs results to a file.
From Sketch File
invertkey.sh in=reference.fasta key=sample.sketch k=31 printonce=f
Inverts all keys from a sketch file and reports all occurrences (not just the first).
Algorithm Details
InvertKey performs reverse lookup of sketch hash values to recover the original k-mer sequences through a linear scanning process implemented in the invert()
method.
Key Processing Strategy
- Single Key Mode: Converts string to hash using
Sketch.parseA48(keyString)
, then transforms toLong.MAX_VALUE-parsed_value
for consistent hash representation - Multiple Key Mode: Splits comma-separated keys, initializes
LongHashSet
withsplit.length*2
capacity, applies same transformation to each key - Sketch File Mode: Uses
SketchTool.loadSketchesFromFile()
to load sketch, extracts all keys fromsk.keys
array, transforms each viaLong.MAX_VALUE-x
K-mer Scanning Process
- Forward and Reverse Complementarity: For each position, computes both forward k-mer and reverse complement k-mer
- Rolling Hash: Uses efficient bit-shifting operations to compute k-mers incrementally rather than re-computing from scratch
- Hash Computation: Applies the same hash function used in sketch generation to compare against target keys
- Canonical K-mers: Uses the lexicographically smaller of forward and reverse complement for matching
Output Formatting
- FASTA Format: Outputs k-mer sequences in FASTA format with headers containing the original sketch key and position
- Position Information: Each match includes the position where the k-mer was found in the reference
- Early Termination: When printonce=true, stops after finding the first occurrence of each key for efficiency
Performance Characteristics
- Memory Usage: Minimal memory overhead - only stores the target keys in a hash set
- Time Complexity: Linear scan through reference sequences, O(n) where n is total reference length
- Hash Lookup: Constant-time hash lookups using LongHashSet for efficient key matching
- Scalability: Can handle large reference files and many target keys simultaneously
Technical Implementation
The k-mer rolling hash maintains 64-bit representations for both forward (kmer
) and reverse complement (rkmer
) sequences. Base encoding uses AminoAcid.baseToNumber
arrays for 2-bit per base compression. Invalid bases (N, ambiguous) reset length counter and clear reverse k-mer state. The hash function applies the same transformation used in sketch generation (SketchObject.hash()
) ensuring identical hash values for matching k-mers. The algorithm supports k-mer lengths up to 32 bases due to 64-bit integer constraints.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org