SendSketch

Script: sendsketch.sh Package: sketch Class: SendSketch.java

Remote taxonomic identification tool that compares query sketches to reference databases hosted on JGI servers. Designed for rapid species identification from assemblies or raw sequencing reads using MinHash sketching.

Quick Start

Assembly Identification (Recommended)

# Simple assembly identification - works for most cases
sendsketch.sh in=assembly.fasta

# Use specific servers
sendsketch.sh in=genome.fasta refseq
sendsketch.sh in=16s_sequences.fasta silva

For assembled genomes or transcriptomes, SendSketch works best with no additional parameters. Assembly quality is generally superior to reads, providing more accurate genome size and completeness estimates.

Raw Read Analysis

# Illumina/Ion Torrent reads (recommended settings)
sendsketch.sh in=reads.fq reads=1m samplerate=0.5 minkeycount=2

# PacBio/Nanopore reads (higher coverage, larger sketches)
sendsketch.sh in=pacbio.fq reads=100k size=200k

# Paired-end reads with merging for improved accuracy
sendsketch.sh in=reads.fq reads=400k merge=t minprob=0.2

Raw reads contain more coverage than needed and include error k-mers that pollute sketches. The recommended settings sample sufficient coverage (1-3x) while filtering out likely sequencing errors through k-mer count thresholds.

Server Selection

SendSketch connects to JGI-hosted reference databases, each optimized for different identification tasks. Server selection automatically configures appropriate blacklists and k-mer parameters.

Available Servers

refseq (default): RefSeq bacterial, archaeal, and viral genomes. Best general-purpose option for most prokaryotic identification tasks. Automatically applies RefSeq blacklist and uses k=31.
nt: NCBI Nucleotide database including environmental and uncultured sequences. Broader coverage than RefSeq but may include more false positives. Uses NT-specific blacklist.
silva or ribo: Silva ribosomal RNA database (16S/18S sequences). Specialized for ribosomal gene identification and phylogenetic analysis. Recommended for amplicon sequencing data.
protein: RefSeq prokaryotic amino acid sequences. For nucleotide input, genes are called and translated before comparison. Uses protein-specific blacklist and longer sketches (AUTOSIZE_FACTOR=3.0).
amino protein: For direct amino acid sequence input. Requires protein sequences in FASTA format.

Server-Specific Examples

# Default RefSeq identification
sendsketch.sh in=bacterial_isolate.fasta

# Environmental sample analysis with broader database
sendsketch.sh in=metagenome_contigs.fasta nt

# Ribosomal gene identification
sendsketch.sh in=16s_amplicons.fasta silva

# Protein-based identification (translates first)
sendsketch.sh in=unknown_genome.fasta protein

# Direct protein comparison
sendsketch.sh in=proteins.faa amino protein

Parameters

Standard parameters

in=<file>: Sketch or fasta file to compare. Can be pre-computed sketch from sketch.sh, or sequence data (fasta/fastq) from which sketches will be generated.
out=stdout: Comparison output destination. Defaults to stdout but can be redirected to a file.
outsketch=: Optional filename to save the generated query sketch for future comparisons.
local=f: Have the server load sketches directly from filesystem. Enables whitelist usage and is recommended for Silva server. Only works when client and server share filesystem (e.g., Genepool and Cori).
address=: Remote server address. Default: https://refseq-sketch.jgi.doe.gov/sketch. Server abbreviations (nt, refseq, silva, protein) automatically set address, blacklist, and k-mer parameters.
aws=f: Use AWS servers instead of NERSC. Helpful when NERSC or SF Bay area connectivity is impaired.

Sketch-making parameters

mode=single

Sketching mode for fasta input:

single: One sketch per file
sequence: One sketch per sequence

k=31

K-mer length (1-32). Automatically set for JGI servers and generally doesn't need manual specification. Only required for custom servers.

samplerate=1

Fraction of reads to sample. For raw reads, values of 0.3-0.7 reduce error k-mers while maintaining sufficient coverage for identification. Higher values better for high-error data like PacBio.

minkeycount=1

Minimum k-mer occurrence threshold. Values >1 filter error k-mers from raw reads. Recommended: minkeycount=2 for Illumina reads.

minprob=0.0001

K-mer probability threshold for quality-based filtering. Higher values (e.g., 0.2) substantially improve ANI estimates from Illumina reads.

minqual=0

Ignore k-mers spanning bases below this quality score.

entropy=0.66

Filter low-complexity sequence below this Shannon entropy threshold (0-1 scale). Removes uninformative repetitive sequences.

merge=f

Merge paired reads before sketching. Improves quality, eliminates adapter sequence, and prevents double-counting overlapping k-mers for more accurate depth estimation.

amino=f

Input consists of amino acid sequences. Use with protein servers.

translate=f

Call genes and translate to proteins. Input should be nucleotides. Designed for prokaryotic genomes.

sixframes=f

Translate all 6 reading frames instead of gene prediction. More comprehensive but computationally intensive.

ssu=t

Scan for and retain full-length SSU (16S/18S) sequences. Particularly useful for phylogenetic analysis.

printssusequence=f

Include SSU sequence in JSON output format.

refid=

Specify reference sketch by name or NCBI TaxID instead of query file. Examples: refid=h.sapiens or refid=9606.

Size parameters

size=10000: Fixed sketch size when autosize is disabled. Generally not recommended - autosize provides better scaling.
mgf=0.01: (maxgenomefraction) Maximum fraction of genomic k-mers in sketches. Default 0.01 means sketches use at most 1% of total genomic k-mers.
minsize=100: Minimum sequence length for sketch generation. Very short sequences work poorly with sketching.
autosize=t: Flexible sketch sizing based on genome size. Produces ~10,000 k-mers for bacteria, ~40,000 for vertebrates. Recommended over fixed sizing.
sizemult=1: Multiply autosize results by this factor. Doubling sketch size approximately doubles sensitivity for short sequences.
density=: When set (0-1), use this fraction of genomic k-mers, overriding autosize. Example: density=0.001 gives 4,500 k-mers for a 4.5Mbp genome.
sketchheapfactor=4: When minkeycount>1, temporarily track this multiple of k-mers until counts are determined and low-count k-mers discarded.

Taxonomy and filtering parameters

level=2: Report best record per taxa at this level. Levels: 0=disabled, 1=subspecies, 2=species, 3=genus, etc.
include=: Restrict output to specific clades. Comma-delimited list of taxonomic names or NCBI TaxIDs.
includelevel=0: Promote include list to this taxonomic level. Example: include=h.sapiens includelevel=phylum includes all Chordata.
includestring=: Only report records whose name contains this text string.
exclude=: Ignore specific clades. Comma-delimited list of names or TaxIDs.
excludelevel=0: Promote exclude list to this taxonomic level.
excludestring=: Exclude records whose name contains this text string.
banunclassified=f: Ignore organisms from 'unclassified' taxonomic nodes.
banvirus=f: Exclude viral sequences from results.
requiressu=f: Only include references with associated SSU sequences.
minrefsize=0: Ignore reference sketches smaller than this (unique k-mers).
minrefsizebases=0: Ignore references smaller than this (total base pairs).

Output format parameters

format=2: Output format: 2=default tabular, 3=single line per hit, 4=JSON, 5=Constellation.
usetaxidname=f: For format 3, use TaxID instead of taxonomic name in name column.
usetaxname: For format 3, use full taxonomic name in name column.
useimgname: For format 3, use IMG identifier in name column.
d3=f: JSON output with tree structure for D3.js visualization.

Output column parameters (for format=2)

printall=f: Enable all output columns.
printani=t: Average nucleotide identity estimate.
completeness=t: Genome completeness estimate.
score=f: Comparison score used for result ranking.
printmatches=t: Number of k-mer matches between query and reference.
printlength=f: Total k-mers compared.
printtaxid=t: NCBI taxonomic identifier.
printimg=f: IMG identifier (when available).
printgbases=f: Estimated genomic bases.
printgkmers=f: Total genomic k-mers.
printgsize=t: Estimated unique genomic k-mers.
printgseqs=t: Number of sequences (scaffolds/reads).
printtaxname=t: Taxonomic name associated with TaxID.
printname0=f: Original sequence name from FASTA header.
printqfname=t: Query filename.
printrfname=f: Reference filename.
printtaxa=f: Complete taxonomic lineage.
printcontam=t: Contamination estimate based on k-mers present in other references.
printunique=t: Matches unique to this reference.
printunique2=f: Matches unique to this reference's taxonomic group.
printunique3=f: Query k-mers unique to this reference's taxonomic group.
printnohit=f: K-mers that match no reference.
printrefhits=f: Average reference sketches hit by shared k-mers.
printgc=f: GC content percentage.
printucontam=f: Contamination hits to exactly one reference.
printcontam2=f: Contamination from taxonomically unrelated references.
contamlevel=species: Taxonomic level for contamination calculations.
printdepth=f: Average depth of sketch k-mers from reads.
printdepth2=f: Depth compensated for genomic repeats.
actualdepth=t: Convert observed to estimated actual depth including uncovered regions.
printvolume=f: Product of average depth and match count.
printca=f: Common ancestor taxonomy (when query TaxID known).
printcal=f: Common ancestor taxonomic level.
recordsperlevel=0: Maximum records per common ancestor level.

Sorting parameters

sortbyscore=t: Default sort by comparison score.
sortbydepth=f: Include depth in sort order.
sortbydepth2=f: Include repeat-compensated depth in sort.
sortbyvolume=f: Include volume in sort order.
sortbykid=f: Sort by k-mer identity only.
sortbyani=f: Sort by ANI/AAI/WKID only.
sortbyhits=f: Sort by match count only.

Other output parameters

minhits=3: Minimum k-mer matches required for reporting.
minani=0: Minimum ANI threshold for reporting (0-1).
minwkid=0.0001: Minimum WKID threshold for reporting (0-1).
anifromwkid=t: Calculate ANI from WKID. If false, use KID.
minbases=0: Ignore references shorter than this length.
minsizeratio=0: Don't compare if size ratio between genomes is below this threshold.
records=20: Maximum records to report per query.
color=family: Color-code results at specified taxonomic level. Set color=f to disable.
intersect=f: Print detailed sketch intersection information.

Metadata parameters

taxid=-1: Set query NCBI taxonomic identifier.
imgid=-1: Set query IMG identifier.
spid=-1: Set sequencing project identifier (JGI-specific).
name=: Set taxonomic name for query.
name0=: Set name0 field (normally from first sequence header).
fname=: Set filename field.
meta_=: Set arbitrary metadata fields. Example: meta_Month=March.

Other parameters

requiredmeta=: Required metadata values for reference filtering. Example: rmeta=subunit:ssu,source:silva.
rmeta=: Alias for requiredmeta. Required optional metadata values for reference filtering.
bannedmeta=: Forbidden metadata values for reference filtering.

Java Parameters

-Xmx: Set Java memory usage. -Xmx20g = 20GB, -Xmx200m = 200MB. Maximum typically 85% of physical memory.
-eoom: Exit on out-of-memory exception. Requires Java 8u92+.
-da: Disable assertions for performance.

Recommended Workflows

Bacterial Isolate Identification

# Simple assembly identification
sendsketch.sh in=isolate_assembly.fasta

# High-stringency identification
sendsketch.sh in=isolate.fasta minhits=10 minani=0.95 banvirus=t

For bacterial isolates, assembled genomes provide the most accurate identification. Use high stringency settings when precise identification is critical.

Metagenome Analysis

# Contigs from metagenome assembly
sendsketch.sh in=metagenome_contigs.fasta mode=sequence records=50

# Raw metagenomic reads (conservative sampling)
sendsketch.sh in=metagenome_reads.fq reads=500k samplerate=0.3 minkeycount=2

Metagenomic samples benefit from per-sequence analysis and increased record limits to capture community diversity. Conservative read sampling prevents dominant species from overwhelming sketches.

Ribosomal Gene Analysis

# 16S/18S amplicon identification
sendsketch.sh in=amplicons.fasta silva

# Local Silva server with whitelist mode (if available)
sendsketch.sh in=amplicons.fasta silva local=t size=100k

Silva server specializes in ribosomal sequences. Local mode with whitelists provides enhanced sensitivity when client and server share filesystem access.

Quality Control and Contamination Screening

# Comprehensive contamination analysis
sendsketch.sh in=sample.fasta printcontam=t printunique=t printunique2=t

# Focus on human contamination
sendsketch.sh in=reads.fq include=9606 includelevel=0 records=5

# Exclude common lab contaminants
sendsketch.sh in=culture.fasta exclude=9606,4932,562 excludelevel=1

SendSketch can identify contamination sources by analyzing unique and shared k-mers across taxonomic groups.

Output Interpretation

Key Metrics

ANI (Average Nucleotide Identity): Estimated sequence identity derived from k-mer sharing. Values >95% typically indicate same species, >85% same genus.
Completeness: Percentage of reference genome represented in query. Lower values may indicate incomplete assemblies or partial matches.
Contamination: Percentage of query k-mers matching other references, suggesting mixed samples or contamination.
WKID (Weighted K-mer Identity): K-mer identity compensated for genome size differences. More accurate than KID for comparing genomes of different sizes.
Matches: Number of shared k-mers between query and reference. Higher values indicate stronger evidence for relatedness.

Interpreting Results

Result accuracy depends on how many assumptions are violated:

Perfect conditions: Complete, pure, high-identity samples yield accurate ANI, completeness, and contamination estimates
One violation: ANI remains accurate, but completeness and contamination estimates may drift
Multiple violations: All estimates become less reliable but still provide useful relative rankings

Algorithm Details

MinHash Sketching Process

SendSketch uses the SketchTool class to generate MinHash sketches from input sequences. The process extracts all k-mers, applies hash functions, and retains the smallest hash codes to create fixed-size sketches. This approach enables rapid comparison by reducing large genomes to representative k-mer sets.

Client-Server Architecture

Communication occurs through ServerTools.sendAndReceive() using HTTP POST requests. To prevent server overload, sketches are transmitted in batches limited by SEND_BUFFER_MAX_SKETCHES (400 sketches) and SEND_BUFFER_MAX_BYTES (8MB). For large datasets, these limits scale automatically: 2x for >1000 sketches, 4x for >4000 sketches.

Performance Characteristics

SendSketch achieves O(1) expected time complexity regardless of database size. Query time depends only on the number of related organisms sharing k-mers with the query, not total database size. This means adding unrelated sequences to reference databases doesn't affect query performance.

Reference Database Structure

Server selection automatically configures database-specific parameters:

RefSeq: AUTOSIZE_FACTOR=2.0, RefSeq blacklist, optimized for prokaryotic genomes
NT: Broader coverage including environmental sequences, NT-specific blacklist
Silva: Ribosomal specialization, Silva blacklist, supports whitelist mode locally
Protein: AUTOSIZE_FACTOR=3.0, amino acid k-mers, prokaryotic protein blacklist

Similarity Calculation

Server-side comparison uses Jaccard similarity between MinHash sketch sets. ANI estimation applies empirical formulas that correlate k-mer sharing with nucleotide identity. Completeness estimates query coverage as the fraction of reference k-mers present in the query sketch.

Contamination Detection

The system identifies contamination by tracking k-mers present in some references but absent from the best match. Contamination estimates factor in taxonomic distance using the contamlevel parameter to distinguish true contamination from legitimate sequence sharing.

Blacklist and Entropy Filtering

Automatic blacklisting removes uninformative k-mers that are highly conserved (like ribosomal primers) or low-complexity (like homopolymers). The default entropy threshold of 0.66 filters repetitive sequences that can cause spurious matches between unrelated organisms.

Support

For questions and support:

Read the BBSketch Guide at bbtools/docs/guides/BBSketchGuide.txt for comprehensive information
Email: bbushnell@lbl.gov
Documentation: bbmap.org