Sketch

Overview

Sketch creates compact hash-based representations of sequences using the MinHash algorithm. These sketches can be used for:

Taxonomic identification - Identify species from assemblies or raw reads
Reference database creation - Build searchable sketch databases from RefSeq, nt, or custom sequences
Contamination detection - Screen for contaminating organisms in sequencing data
Rapid similarity screening - Compare sequences without expensive alignments

Sketch is typically used as part of a larger workflow where sketches are created once and queried many times using CompareSketch or SendSketch.

Quick Start

Basic sketch creation

sketch.sh in=genome.fasta out=genome.sketch

Creates a single sketch from an assembled genome.

Reference database creation (per sequence)

sketch.sh in=refseq.fa.gz out=refseq#.sketch files=31 mode=sequence

Creates individual sketches for each sequence, distributing across 31 files for faster loading.

Taxonomic database creation

sketch.sh in=refseq.fa.gz out=refseq#.sketch files=31 mode=taxa \
  tree=tree.taxtree.gz gi=gitable.int1d.gz taxlevel=subspecies

Groups sequences by taxonomy, creating one sketch per taxonomic unit at subspecies level.

Parameters

Standard parameters

in=<file>: A fasta file containing one or more sequences.
out=<file>: Output filename. If multiple files are desired it must contain the # symbol.
blacklist=<file>: Ignore keys in this sketch file. Additionally, there are built-in blacklists that can be specified:
nt: Blacklist for nt
refseq: Blacklist for Refseq
silva: Blacklist for Silva
img: Blacklist for IMG
files=1: Number of output sketch files to produce, for parallel loading. Independent of the number of sketches produced; sketches will be randomly distributed between files.
k=32,24: Kmer length, 1-32. To maximize sensitivity and specificity, dual kmer lengths may be used, e.g. k=32,24. Query and reference k must match.
rcomp=t: Look at reverse-complement kmers also.
amino=f: Use amino acid mode. Input should be amino acids.
translate=f: Call genes and translate to proteins. Input should be nucleotides. Designed for prokaryotes.
mode=single: Possible modes:
single: Write one sketch.
sequence: Write one sketch per sequence.
taxa: Write one sketch per taxonomic unit. Requires more memory, and taxonomic annotation.
img: Write one sketch per IMG id.
delta=t: Delta-compress sketches.
a48=t: Encode sketches as ASCII-48 rather than hex.
depth=f: Track the number of times kmers appear. Required for the depth2 field in comparisons.
entropy=0.66: Ignore sequence with entropy below this value.
ssu=t: Scan for and retain full-length SSU sequence.

Size parameters

size=10000: Desired size of sketches (if not using autosize).
maxfraction=0.01: (mgf) Max fraction of genomic kmers to use.
minsize=100: Do not generate sketches for genomes smaller than this.
autosize=t: Use flexible sizing instead of fixed-length. This is nonlinear; a human sketch is only ~6x a bacterial sketch.
sizemult=1: Multiply the autosized size of sketches by this factor. Normally a bacterial-size genome will get a sketch size of around 10000; if autosizefactor=2, it would be ~20000.
density=: If this flag is set (to a number between 0 and 1), autosize and sizemult are ignored, and this fraction of genomic kmers are used. For example, at density=0.001, a 4.5Mbp bacteria will get a 4500-kmer sketch.

Metadata parameters (optional; intended for single-sketch mode)

taxid=-1: Set the NCBI taxid.
imgid=-1: Set the IMG id.
spid=-1: Set the JGI sequencing project id.
name=: Set the name (taxname).
name0=: Set name0 (normally the first sequence header).
fname=: Set fname (normally the file name).
meta_=: Set an arbitrary metadata field. For example, meta_Month=March.

Taxonomy-specific parameters

tree=: Specify a taxtree file. On Genepool, use 'auto'.
gi=: Specify a gitable file. On Genepool, use 'auto'.
accession=: Specify one or more comma-delimited NCBI accession to taxid files. On Genepool, use 'auto'.
imgdump=: Specify an IMG dump file containing NCBI taxIDs, for IMG mode.
taxlevel=subspecies: Taxa hits below this rank will be promoted and merged with others.
prefilter=f: For huge datasets full of junk like nt, this flag will save memory by ignoring taxa smaller than minsize. Requires taxonomic information (tree and gi).
tossjunk=t: For taxa mode, discard taxonomically uninformative sequences. This includes sequences with no taxid, with a tax level NO_RANK, of parent taxid of LIFE.
silva=f: Parse headers using Silva or semicolon-delimited syntax.

Ribosomal parameters, which allow SSU sequences to be attached to sketches

processSSU=t: Run gene-calling to detect ribosomal SSU sequences.
16Sfile=<file>: Optional file of 16S sequences, annotated with TaxIDs.
18Sfile=<file>: Optional file of 18S sequences, annotated with TaxIDs.
preferSSUMap=f: Prefer file SSUs over called SSUs.
preferSSUMapEuks=t: Prefer file SSUs over called SSUs for Eukaryotes.
SSUMapOnly=f: Only use file SSUs.
SSUMapOnlyEuks=f: Only use file SSUs for Eukaryotes. This prevents associating an organism with its mitochondrial or chloroplast 16S/18S, which is otherwise a problem.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Workflow Examples

Creating a Reference Database

Build a taxonomic sketch database from RefSeq:

# Download and process taxonomy files (see fetchTaxonomy.sh)
# Download RefSeq sequences

# Create taxonomic sketches
sketch.sh in=refseq.fa.gz out=refseq_taxa#.sketch files=31 \
  mode=taxa tree=tree.taxtree.gz gi=gitable.int1d.gz \
  taxlevel=subspecies blacklist=blacklist_refseq_species_1000.sketch

# Start a sketch server for querying
taxserver.sh port=1234 tree.taxtree.gz gi=gitable.int1d.gz refseq_taxa*.sketch

This creates a searchable taxonomic database that can be queried via HTTP.

Processing Raw Sequencing Reads

For read data, sampling and error filtering improves results:

# For Illumina reads
sendsketch.sh in=reads.fq reads=1m samplerate=0.5 minkeycount=2

# For PacBio/Nanopore reads
sendsketch.sh in=long_reads.fq reads=100k size=200k

Raw reads require careful parameter tuning to handle sequencing errors and coverage.

Multi-Kmer Sketching

sketch.sh in=genomes.fasta out=dual_kmer.sketch k=31,24

Dual kmer lengths provide sensitivity of shorter kmers with specificity of longer kmers.

Protein Sketching

sketch.sh in=proteins.fasta out=proteins.sketch amino=t k=10

Amino acid mode uses different encoding and typically shorter kmers.

Algorithm Details

MinHash Theory

MinHash sketching creates compact sequence representations through a three-step process:

Kmer generation: Extract all length-k kmers from the input sequence
Hash calculation: Apply a hash function to convert each kmer to a numerical value
Minimum selection: Keep only the N smallest hash values as the "sketch"

For two sequences with sketches of size N and M, comparison requires only O(min(N,M)) operations rather than the O(sequence_length²) required for alignment. Since sketch size is typically fixed (e.g., 10,000 kmers), comparisons run in effectively O(1) time regardless of genome size.

Sketch Size Strategies

The SketchMaker class implements three sizing approaches:

Fixed size (size parameter): All sketches contain exactly the same number of kmers
Autosize (default): Sketch size scales nonlinearly with genome complexity. A bacterial genome (~4Mbp) gets ~10,000 kmers, while a human genome gets ~60,000 kmers (only 6x larger despite being 800x larger)
Density-based (density parameter): Fixed fraction of all genomic kmers. A 4.5Mbp bacterium with density=0.001 gets a 4,500-kmer sketch

Taxonomic Processing

In taxa mode, sequences are grouped by taxonomic classification using:

Header parsing: Extracts taxonomic information from sequence headers (NCBI or Silva format)
ID resolution: Maps GI numbers or accessions to NCBI taxonomy using lookup tables
Level promotion: Groups sequences at the specified taxonomic level (e.g., subspecies, species)
Quality filtering: Removes uninformative sequences (tossjunk parameter)

Blacklists and Contamination Control

Blacklists filter out uninformative kmers that occur in many unrelated organisms:

Low-complexity kmers: Poly-A runs, simple repeats
Highly conserved regions: Universal primer sites, ribosomal conserved regions
Contamination patterns: Common laboratory contaminants

Built-in blacklists are optimized for major databases (nt, RefSeq, Silva) and automatically loaded when using SendSketch with remote servers.

Depth Tracking

With depth=t, sketches store kmer occurrence counts along with hash values. This enables:

Coverage estimation: Estimate sequencing depth from kmer frequencies
Copy number detection: Identify repetitive elements and their copy numbers
Contamination quantification: Assess relative abundance of different organisms

Depth calculations assume random read distribution and single-copy genomes.

Performance and Scalability

Memory Usage

Memory requirements scale with:

Sketch size: Larger sketches require more memory but provide higher resolution
Number of taxa: Taxa mode maintains one sketch per taxonomic unit in memory
Thread count: Each thread maintains local sketches, automatically capped at 14 threads

I/O Optimization

The implementation optimizes for large datasets through:

File distribution: Multiple output files enable parallel loading and processing
Delta compression: Stores hash differences rather than absolute values
Streaming processing: Processes sequences without loading entire files into memory
Prefiltering: Automatically enabled for large taxonomic datasets to reduce memory usage

Recommended Usage Patterns

Assemblies: Use default parameters - assemblies provide the highest quality sketches
Illumina reads: Use reads=1m, samplerate=0.5, minkeycount=2 to handle errors
Long reads: Use higher coverage (reads=100k) and larger sketch size (size=200k)
Large databases: Enable prefilter and use multiple output files for parallel processing

Integration with Other Tools

Sketch is designed as part of the BBSketch ecosystem:

CompareSketch: Compare query sketches against sketch databases
SendSketch: Query remote sketch servers (JGI provides public nt, RefSeq, and Silva servers)
TaxServer: Host sketch databases as HTTP services for rapid querying
SketchBlacklist: Create custom blacklists from large sequence collections

Support

For detailed information, please read bbmap/docs/guides/BBSketchGuide.txt.

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Public servers available at JGI for nt, RefSeq, and Silva databases