CompareSketch

Script: comparesketch.sh Package: sketch Class: CompareSketch.java

Compares query sketches to reference sketches using MinHash k-mer matching. Designed to be extremely fast (potentially thousands of times faster than BLAST) with very low memory footprint, while providing approximate sequence similarity results without alignments. Supports both nucleotide and amino acid sequences with taxonomic classification capabilities.

Overview and Theory

CompareSketch uses MinHash to rapidly compare large sequences by creating fixed-size "sketches" containing the smallest hash values from all k-mers in a sequence. This approach achieves O(1) comparison time regardless of sequence size - whether comparing 30kbp queries to 300kbp genomes or 2Gbp queries to 200Mbp genomes, comparison time depends only on the number of related organisms, not database size.

MinHash Sketch Creation Process

  1. K-mer Generation: Extract all overlapping k-mers (typically k=31) from input sequence
  2. Hash Function: Apply hash function to convert each k-mer to a numerical hash code
  3. Sketch Selection: Keep the N smallest hash codes (typically 10,000) as the "sketch"

The resulting k-mer identity strongly correlates with sequence identity. For two sequences with specific average nucleotide identity (ANI), the probability of sharing a k-mer is ANI^k, making shorter k-mers more sensitive for distant relationships while longer k-mers provide better specificity.

Performance Advantages

Traditional alignment requires O(N*M) operations for sequences of length N and M. MinHash comparison requires only O(S) operations where S is the fixed sketch size. Since sketches are indexed by k-mer content, the algorithm only compares against related sketches, achieving expected O(1) time per comparison.

Basic Usage

comparesketch.sh in=<file,file,file...> ref=<file,file,file...>
comparesketch.sh in=<file,file,file...> file file file
comparesketch.sh in=<file,file,file...> *.sketch
comparesketch.sh alltoall *.sketch.gz

Typical Workflow

CompareSketch works best when references are pre-sketched using sketch.sh. The typical pattern is:

  1. Sketch references: sketch.sh in=refseq.fa.gz out=refseq#.sketch files=31
  2. Compare queries: comparesketch.sh in=assembly.fa refseq*.sketch

Parameters

Parameters control sketch generation, comparison algorithms, and output formatting. CompareSketch accepts pre-computed .sketch files for direct comparison or processes raw FASTA/FASTQ through on-the-fly sketching with configurable k-mer parameters.

File parameters

in=<file,file...>
Sketches or fasta files to compare.
out=stdout
Comparison output. Can be set to a file instead.
outsketch=<file>
Optionally write sketch files generated from the input.
ref=<file,file...>
List of sketches to compare against. Files given without a prefix (ref=) will be treated as references, so you can use *.sketch without ref=. You can also do ref=nt#.sketch to load all numbered files fitting that pattern. On NERSC, you can use these abbreviations (e.g. ref=nt): nt: nt sketches, refseq: Refseq sketches, silva: Silva sketches, img: IMG sketches, mito: RefSeq mitochondrial sketches, fungi: RefSeq fungi sketches, protein: RefSeq prokaryotic amino acid sketches. Using an abbreviation automatically sets the blacklist and k. If the reference is in amino space, the query also be in amino acid space with the flag amino added. If the query is in nucleotide space, use the flag 'translate', but this will only work for prokaryotes.

Blacklist and Whitelist parameters

blacklist=<file>
Ignore keys in this sketch file. Additionally, there are built-in blacklists that can be specified: nt: Blacklist for nt, refseq: Blacklist for Refseq, silva: Blacklist for Silva, img: Blacklist for IMG. Blacklists exclude low-complexity or highly conserved k-mers that would inflate apparent relatedness between unrelated organisms.
whitelist=f
Ignore keys that are not in the index. Requires index=t. Useful for specific comparisons like genome vs ribosomal database where most query k-mers are irrelevant.

Sketch-making parameters

mode=perfile
Possible modes, for sequence input: single: Generate one sketch, sequence: Generate one sketch per sequence, perfile: Generate one sketch per file.
sketchonly=f
Don't run comparisons, just write the output sketch file.
k=31
Kmer length, 1-32. To maximize sensitivity and specificity, dual kmer lengths may be used: k=31,24. Dual kmers are fastest if the shorter is a multiple of 4. Query and reference k must match. Shorter k-mers increase sensitivity for distant relationships while longer k-mers improve specificity.
samplerate=1
Set to a lower value to sample a fraction of input reads. For raw reads (rather than an assembly), 1-3x coverage gives best results, by reducing error kmers. Somewhat higher is better for high-error-rate data like PacBio.
minkeycount=1
Ignore kmers that occur fewer times than this. Values over 1 can be used with raw reads to avoid error kmers. Recommended minkeycount=2 for Illumina reads to increase both sensitivity and specificity.
minprob=0.0001
Ignore kmers below this probability of correctness. Default is extremely low to prevent raw Nanopore and PacBio data from being 100% discarded. For Illumina, minprob=0.2 can substantially improve ANI estimates.
minqual=0
Ignore kmers spanning bases below this quality.
entropy=0.66
Ignore sequence with entropy below this value. Removes low-complexity sequence like poly-A runs that are uninformative for comparison. Higher values (0.8) eliminate virtually all uninformative matches between unrelated organisms.
merge=f
Merge paired reads prior to sketching. Improves quality, eliminates adapter sequence, and combines redundant k-mers from overlapping regions, improving depth estimation.
amino=f
Use amino acid mode. Input should be amino acids.
translate=f
Call genes and translate to proteins. Input should be nucleotides. Designed for prokaryotes.
sixframes=f
Translate all 6 frames instead of predicting genes.
ssu=t
Scan for and retain full-length SSU sequence.
printssusequence=f
Print the query SSU sequence (JSON mode only).

Size parameters

size=10000
Desired size of sketches (if not using autosize).
mgf=0.01
(maxfraction) Max fraction of genomic kmers to use. Default 0.01 means sketch length is at most 1% of sequence length.
minsize=100
Do not generate sketches for genomes smaller than this.
autosize=t
Use flexible sizing instead of fixed-length. This is nonlinear; a human sketch is only ~6x a bacterial sketch. Bacterial genomes get ~10,000 k-mers, ribosomal sequences get ~120, vertebrates get ~40,000.
sizemult=1
Multiply the autosized size of sketches by this factor. Normally a bacterial-size genome will get a sketch size of around 10000; if autosizefactor=2, it would be ~20000. To increase sensitivity, use same factor for both reference and query sketches.
density=
If this flag is set (to a number between 0 and 1), autosize and sizemult are ignored, and this fraction of genomic kmers are used. For example, at density=0.001, a 4.5Mbp bacteria will get a 4500-kmer sketch.
sketchheapfactor=4
If minkeycount>1, temporarily track this many kmers until counts are known and low-count kmers are discarded.

Sketch comparing parameters

threads=auto
Use this many threads for comparison.
index=auto
Index the sketches for much faster searching. Requires more memory and adds startup time. Recommended true for many query sketches, false for few. Auto-indexing triggers when more than 8 input sketches or contamination analysis is needed.
prealloc=f
Preallocate the index for greater efficiency. Can be set to a number between 0 and 1 to determine how much of total memory should be used.
alltoall
(ata) Compare all refs to all. Must be sketches. Creates symmetric comparison matrix for phylogenetic analysis.
compareself=f
In all-to-all mode, compare a sketch to itself.

Taxonomy-related parameters

tree=<file>
Specify a TaxTree file. On Genepool, use tree=auto. Only necessary for use with printtaxa and level. Assumes comparisons are done against reference sketches with known taxonomy information.
level=2
Only report the best record per taxa at this level. Either level names or numbers may be used. 0: disabled, 1: subspecies, 2: species, 3: genus, etc. Prevents redundant reporting of closely related organisms.
include=
Restrict output to organisms in these clades. May be a comma-delimited list of names or NCBI TaxIDs.
includelevel=0
Promote the include list to this taxonomic level. For example, include=h.sapiens includelevel=phylum would only include organisms in the same phylum as human.
includestring=
Only report records whose name contains this string.
exclude=
Ignore organisms in these clades. May be a comma-delimited list of names or NCBI TaxIDs.
excludelevel=0
Promote the exclude list to this taxonomic level. For example, exclude=h.sapiens excludelevel=phylum would exclude all organisms in the same phylum as human.
excludestring=
Do not records whose name contains this string.
minlevel=
Use this to restrict comparisons to distantly-related organisms. Intended for finding misclassified organisms using all-to-all comparisons. minlevel=order would only report hits between organisms related at the order level or higher, not between same species or genus.
banunclassified=f
Ignore organisms descending from nodes like 'unclassified Bacteria'
banvirus=f
Ignore viruses.
requiressu=f
Ignore records without SSUs.
minrefsize=0
Ignore ref sketches smaller than this (unique kmers).
minrefsizebases=0
Ignore ref sketches smaller than this (total base pairs).

Output format parameters

format=2
2: Default format with, per query, one query header line; one column header line; and one reference line per hit. 3: One line per hit, with columns query, reference, ANI, and sizeRatio. Useful for all-to-all comparisons. 4: JSON (format=json also works). 5: Constellation (format=constellation also works).
usetaxidname=f
For format 3, print the taxID in the name column.
usetaxname
for format 3, print the taxonomic name in the name column.
useimgname
For format 3, print the img ID in the name column.

Output column parameters (for format=2)

printall=f
Enable all output columns.
printani=t
(ani) Print average nucleotide identity estimate.
completeness=t
Genome completeness estimate.
score=f
Score (used for sorting the output).
printmatches=t
Number of kmer matches to reference.
printlength=f
Number of kmers compared.
printtaxid=t
NCBI taxID.
printimg=f
IMG identifier (only for IMG data).
printgbases=f
Number of genomic bases.
printgkmers=f
Number of genomic kmers.
printgsize=t
Estimated number of unique genomic kmers.
printgseqs=t
Number of sequences (scaffolds/reads).
printtaxname=t
Name associated with this taxID.
printname0=f
(pn0) Original seqeuence name.
printfname=t
Query filename.
printtaxa=f
Full taxonomy of each record.
printcontam=t
Print contamination estimate, and factor contaminant kmers into calculations. Kmers are considered contaminant if present in some ref sketch but not the current one.
printunique=t
Number of matches unique to this reference.
printunique2=f
Number of matches unique to this reference's taxa.
printunique3=f
Number of query kmers unique to this reference's taxa, regardless of whether they are in this reference sketch.
printnohit=f
Number of kmers that don't hit anything.
printrefhits=f
Average number of ref sketches hit by shared kmers.
printgc=f
GC content.
printucontam=f
Contam hits that hit exactly one reference sketch.
printcontam2=f
Print contamination estimate using only kmer hits to unrelated taxa.
contamlevel=species
Taxonomic level to use for contam2/unique2/unique3.
printdepth=f
(depth) Print average depth of sketch kmers; intended for shotgun read input. Estimates genome coverage from k-mer counts.
printdepth2=f
(depth2) Print depth compensating for genomic repeats. Requires reference sketches to be generated with depth. Divides depth by average reference k-mer count to account for repeated genomic regions.
actualdepth=t
If this is false, the raw average count is printed. If true, the raw average (observed depth) is converted to estimated actual depth (including uncovered areas).
printvolume=f
(volume) Product of average depth and matches.
printca=f
Print common ancestor, if query taxID is known.
printcal=f
Print common ancestor tax level, if query taxID is known.
recordsperlevel=0
If query TaxID is known, and this is positive, print this many records per common ancestor level.

NOTE: unique2/unique3/contam2/refhits require an index.

Sorting parameters

sortbyscore=t
Default sort order is by score, a composite metric.
sortbydepth=f
Include depth as a factor in sort order.
sortbydepth2=f
Include depth2 as a factor in sort order.
sortbyvolume=f
Include volume as a factor in sort order.
sortbykid=f
Sort strictly by KID.
sortbyani=f
Sort strictly by ANI/AAI/WKID.
sortbyhits=f
Sort strictly by the number of kmer hits.

Other output parameters

minhits=3
(hits) Only report records with at least this many hits.
minani=0
(ani) Only report records with at least this ANI (0-1).
minwkid=0.0001
(wkid) Only report records with at least this WKID (0-1).
anifromwkid=t
Calculate ani from wkid. If false, use kid.
minbases=0
Ignore ref sketches of sequences shortert than this.
minsizeratio=0
Don't compare sketches if the smaller genome is less than this fraction of the size of the larger.
records=20
Report at most this many best-matching records.
color=family
Color records at the family level. color=f will disable. Colors work in most terminals but may cause odd characters to appear in text editors. So, color defaults to f if writing to a file. Requires the taxtree to be loaded.
intersect=f
Print sketch intersections. delta=f is suggested.

Metadata parameters (optional, for the query sketch header)

taxid=-1
Set the NCBI taxid.
imgid=-1
Set the IMG id.
spid=-1
Set the JGI sequencing project id.
name=
Set the name (taxname).
name0=
Set name0 (normally the first sequence header).
fname=
Set fname (normally the file name).
meta_=
Set an arbitrary metadata field. For example, meta_Month=March.

Other parameters

requiredmeta=
(rmeta) Required optional metadata values. For example: rmeta=subunit:ssu,source:silva
bannedmeta=
(bmeta) Forbidden optional metadata values.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Workflow Recommendations

Assemblies vs Raw Reads

CompareSketch works best with assembled genomes or transcriptomes due to higher quality. For raw reads, specific settings optimize performance:

Illumina/Ion Torrent Reads

comparesketch.sh in=reads.fq ref=database.sketch reads=1m samplerate=0.5 minkeycount=2

Samples 50% of first 1 million reads, discarding error k-mers with minkeycount=2. This provides sufficient coverage for bacteria while improving sensitivity and specificity.

PacBio/Nanopore Reads

comparesketch.sh in=longread.fq ref=database.sketch reads=100k size=200k

Uses more coverage and larger sketch size to compensate for higher error rates. Remove minkeycount filtering to retain low-frequency genuine k-mers.

Improving Quality with Merging

comparesketch.sh in=reads_R1.fq,reads_R2.fq ref=database.sketch merge=t minprob=0.2

Merging paired reads improves quality, removes adapters, and eliminates redundant k-mers from overlapping regions, improving ANI estimates and depth calculations.

Examples

Basic Assembly Classification

comparesketch.sh in=assembly.fa ref=refseq

Compare assembly against RefSeq database using built-in reference and blacklist. Automatically sets appropriate k-mer length and taxonomic parameters.

Metagenome Taxonomic Profiling

comparesketch.sh in=metagenome.fq ref=nt reads=400k samplerate=0.5 minkeycount=2 level=2

Process raw metagenomic reads against nt database with species-level reporting (level=2) and error k-mer filtering optimized for raw reads.

High-Sensitivity Comparison with Dual K-mers

comparesketch.sh in=distant_relative.fa ref=database.sketch k=31,24 sizemult=2

Use dual k-mer lengths for sensitivity (k=24) and specificity (k=31) with 2x larger sketches. Ideal for detecting distant evolutionary relationships.

All-to-All Phylogenetic Analysis

comparesketch.sh alltoall *.sketch format=3 minani=0.75 compareself=f

Compare all sketches to each other in tab-delimited format, excluding self-comparisons and requiring minimum 75% ANI. Useful for phylogenetic tree construction.

Contamination Detection

comparesketch.sh in=sample.fa ref=database.sketch printcontam=t printunique=t index=t

Analyze contamination by tracking k-mers present in multiple references. Index required for detailed contamination metrics.

Protein Sequence Comparison

comparesketch.sh in=proteins.faa ref=protein amino=t k=15

Compare amino acid sequences using k=15 amino acid k-mers against protein sketch database.

Coverage Analysis from Reads

comparesketch.sh in=reads.fq ref=reference.sketch printdepth=t printdepth2=t merge=t

Estimate genome coverage from shotgun reads using k-mer depth analysis. Merge paired reads to improve depth calculations.

Understanding Output Columns

CompareSketch produces detailed similarity metrics across multiple dimensions. Understanding these columns is crucial for interpreting results:

Core Identity Metrics

KID (Kmer Identity)
Fraction of shared k-mers (matches/length). Raw k-mer similarity before size correction.
WKID (Weighted Kmer Identity)
K-mer identity compensated for size differences. Human chr1 vs full human genome would yield 100% WKID but ~10% KID.
ANI (Average Nucleotide Identity)
Estimated sequence identity derived from WKID and k-mer length. Correlates with traditional alignment-based identity.
Complt (Completeness)
Percentage of reference represented in query, derived from WKID and KID ratios.
Contam (Contamination)
Percentage of query that matches other references but not this one, indicating potential contamination.

Depth and Coverage Metrics

Depth
Estimated genomic k-mer coverage from read data, converted from observed to actual depth using coverage formula.
Depth2
Repeat-compensated depth dividing by average reference k-mer count to account for genomic repeats.
Volume
Sum of query counts of shared k-mers divided by 1000, representing total reference k-mer observations in query.

Accuracy Considerations

ANI, completeness, and contamination metrics are interdependent and most accurate when only one assumption is violated (e.g., low identity but complete genome with no contamination). Multiple violations (incomplete contaminated assembly with low identity) reduce accuracy, though results remain useful approximations.

Algorithm Details

CompareSketch implements MinHash-based sketch comparison using atomic threading controls and configurable comparison buffers for genomic sequence similarity analysis.

Core Implementation

The tool uses sketch.CompareSketch.main() entry point with specialized comparison infrastructure:

Threading and Memory Management

CompareSketch uses specialized threading architecture for concurrent processing:

Comparison Processing Modes

The comparison logic branches based on input configuration:

Contamination Analysis Implementation

Contamination detection uses k-mer intersection analysis with taxonomic context:

Taxonomic Processing Integration

TaxTree integration provides phylogenetically-informed comparison filtering:

Performance and Memory Usage

CompareSketch is designed for scalability with large reference databases:

Memory Requirements

Performance Characteristics

Support

For questions and support: