Sketch
Creates MinHash sketches from FASTA files for rapid taxonomic identification and sequence comparison. Sketches enable approximate similarity calculations thousands of times faster than alignment-based methods while maintaining high accuracy for species-level identification.
Overview
Sketch creates compact hash-based representations of sequences using the MinHash algorithm. These sketches can be used for:
- Taxonomic identification - Identify species from assemblies or raw reads
- Reference database creation - Build searchable sketch databases from RefSeq, nt, or custom sequences
- Contamination detection - Screen for contaminating organisms in sequencing data
- Rapid similarity screening - Compare sequences without expensive alignments
Sketch is typically used as part of a larger workflow where sketches are created once and queried many times using CompareSketch or SendSketch.
Quick Start
Basic sketch creation
sketch.sh in=genome.fasta out=genome.sketch
Creates a single sketch from an assembled genome.
Reference database creation (per sequence)
sketch.sh in=refseq.fa.gz out=refseq#.sketch files=31 mode=sequence
Creates individual sketches for each sequence, distributing across 31 files for faster loading.
Taxonomic database creation
sketch.sh in=refseq.fa.gz out=refseq#.sketch files=31 mode=taxa \
tree=tree.taxtree.gz gi=gitable.int1d.gz taxlevel=subspecies
Groups sequences by taxonomy, creating one sketch per taxonomic unit at subspecies level.
Parameters
Standard parameters
- in=<file>
- A fasta file containing one or more sequences.
- out=<file>
- Output filename. If multiple files are desired it must contain the # symbol.
- blacklist=<file>
- Ignore keys in this sketch file. Additionally, there are built-in blacklists that can be specified:
nt: Blacklist for nt
refseq: Blacklist for Refseq
silva: Blacklist for Silva
img: Blacklist for IMG - files=1
- Number of output sketch files to produce, for parallel loading. Independent of the number of sketches produced; sketches will be randomly distributed between files.
- k=32,24
- Kmer length, 1-32. To maximize sensitivity and specificity, dual kmer lengths may be used, e.g. k=32,24. Query and reference k must match.
- rcomp=t
- Look at reverse-complement kmers also.
- amino=f
- Use amino acid mode. Input should be amino acids.
- translate=f
- Call genes and translate to proteins. Input should be nucleotides. Designed for prokaryotes.
- mode=single
- Possible modes:
single: Write one sketch.
sequence: Write one sketch per sequence.
taxa: Write one sketch per taxonomic unit. Requires more memory, and taxonomic annotation.
img: Write one sketch per IMG id. - delta=t
- Delta-compress sketches.
- a48=t
- Encode sketches as ASCII-48 rather than hex.
- depth=f
- Track the number of times kmers appear. Required for the depth2 field in comparisons.
- entropy=0.66
- Ignore sequence with entropy below this value.
- ssu=t
- Scan for and retain full-length SSU sequence.
Size parameters
- size=10000
- Desired size of sketches (if not using autosize).
- maxfraction=0.01
- (mgf) Max fraction of genomic kmers to use.
- minsize=100
- Do not generate sketches for genomes smaller than this.
- autosize=t
- Use flexible sizing instead of fixed-length. This is nonlinear; a human sketch is only ~6x a bacterial sketch.
- sizemult=1
- Multiply the autosized size of sketches by this factor. Normally a bacterial-size genome will get a sketch size of around 10000; if autosizefactor=2, it would be ~20000.
- density=
- If this flag is set (to a number between 0 and 1), autosize and sizemult are ignored, and this fraction of genomic kmers are used. For example, at density=0.001, a 4.5Mbp bacteria will get a 4500-kmer sketch.
Metadata parameters (optional; intended for single-sketch mode)
- taxid=-1
- Set the NCBI taxid.
- imgid=-1
- Set the IMG id.
- spid=-1
- Set the JGI sequencing project id.
- name=
- Set the name (taxname).
- name0=
- Set name0 (normally the first sequence header).
- fname=
- Set fname (normally the file name).
- meta_=
- Set an arbitrary metadata field. For example, meta_Month=March.
Taxonomy-specific parameters
- tree=
- Specify a taxtree file. On Genepool, use 'auto'.
- gi=
- Specify a gitable file. On Genepool, use 'auto'.
- accession=
- Specify one or more comma-delimited NCBI accession to taxid files. On Genepool, use 'auto'.
- imgdump=
- Specify an IMG dump file containing NCBI taxIDs, for IMG mode.
- taxlevel=subspecies
- Taxa hits below this rank will be promoted and merged with others.
- prefilter=f
- For huge datasets full of junk like nt, this flag will save memory by ignoring taxa smaller than minsize. Requires taxonomic information (tree and gi).
- tossjunk=t
- For taxa mode, discard taxonomically uninformative sequences. This includes sequences with no taxid, with a tax level NO_RANK, of parent taxid of LIFE.
- silva=f
- Parse headers using Silva or semicolon-delimited syntax.
Ribosomal parameters, which allow SSU sequences to be attached to sketches
- processSSU=t
- Run gene-calling to detect ribosomal SSU sequences.
- 16Sfile=<file>
- Optional file of 16S sequences, annotated with TaxIDs.
- 18Sfile=<file>
- Optional file of 18S sequences, annotated with TaxIDs.
- preferSSUMap=f
- Prefer file SSUs over called SSUs.
- preferSSUMapEuks=t
- Prefer file SSUs over called SSUs for Eukaryotes.
- SSUMapOnly=f
- Only use file SSUs.
- SSUMapOnlyEuks=f
- Only use file SSUs for Eukaryotes. This prevents associating an organism with its mitochondrial or chloroplast 16S/18S, which is otherwise a problem.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Workflow Examples
Creating a Reference Database
Build a taxonomic sketch database from RefSeq:
# Download and process taxonomy files (see fetchTaxonomy.sh)
# Download RefSeq sequences
# Create taxonomic sketches
sketch.sh in=refseq.fa.gz out=refseq_taxa#.sketch files=31 \
mode=taxa tree=tree.taxtree.gz gi=gitable.int1d.gz \
taxlevel=subspecies blacklist=blacklist_refseq_species_1000.sketch
# Start a sketch server for querying
taxserver.sh port=1234 tree.taxtree.gz gi=gitable.int1d.gz refseq_taxa*.sketch
This creates a searchable taxonomic database that can be queried via HTTP.
Processing Raw Sequencing Reads
For read data, sampling and error filtering improves results:
# For Illumina reads
sendsketch.sh in=reads.fq reads=1m samplerate=0.5 minkeycount=2
# For PacBio/Nanopore reads
sendsketch.sh in=long_reads.fq reads=100k size=200k
Raw reads require careful parameter tuning to handle sequencing errors and coverage.
Multi-Kmer Sketching
sketch.sh in=genomes.fasta out=dual_kmer.sketch k=31,24
Dual kmer lengths provide sensitivity of shorter kmers with specificity of longer kmers.
Protein Sketching
sketch.sh in=proteins.fasta out=proteins.sketch amino=t k=10
Amino acid mode uses different encoding and typically shorter kmers.
Algorithm Details
MinHash Theory
MinHash sketching creates compact sequence representations through a three-step process:
- Kmer generation: Extract all length-k kmers from the input sequence
- Hash calculation: Apply a hash function to convert each kmer to a numerical value
- Minimum selection: Keep only the N smallest hash values as the "sketch"
For two sequences with sketches of size N and M, comparison requires only O(min(N,M)) operations rather than the O(sequence_length²) required for alignment. Since sketch size is typically fixed (e.g., 10,000 kmers), comparisons run in effectively O(1) time regardless of genome size.
Sketch Size Strategies
The SketchMaker class implements three sizing approaches:
- Fixed size (size parameter): All sketches contain exactly the same number of kmers
- Autosize (default): Sketch size scales nonlinearly with genome complexity. A bacterial genome (~4Mbp) gets ~10,000 kmers, while a human genome gets ~60,000 kmers (only 6x larger despite being 800x larger)
- Density-based (density parameter): Fixed fraction of all genomic kmers. A 4.5Mbp bacterium with density=0.001 gets a 4,500-kmer sketch
Taxonomic Processing
In taxa mode, sequences are grouped by taxonomic classification using:
- Header parsing: Extracts taxonomic information from sequence headers (NCBI or Silva format)
- ID resolution: Maps GI numbers or accessions to NCBI taxonomy using lookup tables
- Level promotion: Groups sequences at the specified taxonomic level (e.g., subspecies, species)
- Quality filtering: Removes uninformative sequences (tossjunk parameter)
Blacklists and Contamination Control
Blacklists filter out uninformative kmers that occur in many unrelated organisms:
- Low-complexity kmers: Poly-A runs, simple repeats
- Highly conserved regions: Universal primer sites, ribosomal conserved regions
- Contamination patterns: Common laboratory contaminants
Built-in blacklists are optimized for major databases (nt, RefSeq, Silva) and automatically loaded when using SendSketch with remote servers.
Depth Tracking
With depth=t, sketches store kmer occurrence counts along with hash values. This enables:
- Coverage estimation: Estimate sequencing depth from kmer frequencies
- Copy number detection: Identify repetitive elements and their copy numbers
- Contamination quantification: Assess relative abundance of different organisms
Depth calculations assume random read distribution and single-copy genomes.
Performance and Scalability
Memory Usage
Memory requirements scale with:
- Sketch size: Larger sketches require more memory but provide higher resolution
- Number of taxa: Taxa mode maintains one sketch per taxonomic unit in memory
- Thread count: Each thread maintains local sketches, automatically capped at 14 threads
I/O Optimization
The implementation optimizes for large datasets through:
- File distribution: Multiple output files enable parallel loading and processing
- Delta compression: Stores hash differences rather than absolute values
- Streaming processing: Processes sequences without loading entire files into memory
- Prefiltering: Automatically enabled for large taxonomic datasets to reduce memory usage
Recommended Usage Patterns
- Assemblies: Use default parameters - assemblies provide the highest quality sketches
- Illumina reads: Use reads=1m, samplerate=0.5, minkeycount=2 to handle errors
- Long reads: Use higher coverage (reads=100k) and larger sketch size (size=200k)
- Large databases: Enable prefilter and use multiple output files for parallel processing
Integration with Other Tools
Sketch is designed as part of the BBSketch ecosystem:
- CompareSketch: Compare query sketches against sketch databases
- SendSketch: Query remote sketch servers (JGI provides public nt, RefSeq, and Silva servers)
- TaxServer: Host sketch databases as HTTP services for rapid querying
- SketchBlacklist: Create custom blacklists from large sequence collections
Support
For detailed information, please read bbmap/docs/guides/BBSketchGuide.txt
.
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Public servers available at JGI for nt, RefSeq, and Silva databases