FindSSU

Basic Usage

findssu.sh ssu1.fa ssu2.fa
findssu.sh genome.fa call
findssu.sh literal=ACGTACGT...
findssu.sh name=Escherichia_coli
findssu.sh name=Saccharomyces_cerevisiae its
findssu.sh tid=562
findssu.sh ssu.fa local
findssu.sh ssu.fa ref16s=custom_16S.tsv ref18s=custom_18S.tsv

By default, FindSSU sends sequences to a remote classification server. In call mode, gene-calling is performed locally and only the extracted SSU sequences are sent to the server, minimizing network traffic. Lookup by name or TaxID and literal sequence queries also work through the server by default. Use local to load the reference database locally and skip the server entirely.

Parameters

Input Parameters

in=<file>: Input file(s). Loose filenames are also accepted as positional arguments. Accepts FASTA, FASTQ, and gzipped variants.
ref=<file>: Pre-built combined SSU+ITS DDL reference file (TSV format). Default: resources/ssuSketchDDL.tsv.gz. Only needed in local mode.
ref16s=<file>: Separate 16S reference file. Overrides the default combined reference for 16S sequences.
ref18s=<file>: Separate 18S reference file. Overrides the default combined reference for 18S sequences.
refits=<file>: Separate ITS reference file. Default: resources/itsSketchDDL.tsv.gz if present. Use to provide a custom ITS DDL reference database.
qf=<file>: Pre-built DDL query file for batch comparison. Queries are loaded from this file instead of raw sequences.

Mode Parameters

call: Enable gene-calling mode. Input is treated as genomic sequence; SSU genes are found via PGM model, then each is individually classified. In client mode, gene-calling runs locally and only extracted SSUs are sent to the server.
literal=<seq>: Provide a query sequence directly on the command line instead of from a file. Works via server (default) or locally. Example: literal=GATGAACGCTGGCGG...
name=<name>: Look up a reference by organism name instead of comparing sequences. Accepts full names (name=Escherichia_coli), abbreviated genus.species (name=E.coli), or partial prefix matches. Outputs TID, Type, Name, and Sequence. Works via server (default) or locally. Combine with type filter flags (its, 16s, etc.) to restrict results.
tid=<int>: Look up a reference by NCBI TaxID (e.g. tid=562). Outputs TID, Type, Name, and Sequence. Works via server (default) or locally. Combine with type filter flags to restrict results by ribo type.
its: In lookup mode, return only ITS records. Example: findssu.sh name=Saccharomyces_cerevisiae its returns only ITS records for that organism. Combinable with 16s: its 16s returns both ITS and 16S records.
16s: In lookup mode, return only 16S records. Combinable with other type flags.
18s: In lookup mode, return only 18S records. Combinable with other type flags.
ssu: In lookup mode, return only SSU records (matches both 16S and 18S). Equivalent to specifying 16s 18s. Combinable with its.
local: Force local mode. Loads the reference database and performs all classification locally without contacting the server. Requires the SSU DDL reference file in resources/.
server=t: Use the JGI SSU server (default). Equivalent to local=f.
address=<url>: Override the default server address for client mode. Example: address=http://myserver:3070/.

Comparison Parameters

records=5: Maximum hits to display per query sequence.
minhits=8: Minimum shared index keys required to compare a reference. Lower values increase sensitivity but slow down the search.
buffer=0: Alignment buffer size. After index filtering, the top max(buffer, 20+2*records) candidates are aligned, then re-sorted by alignment ANI. Bounds alignment cost while ensuring the best match is captured.
index=t: Use inverted index for query acceleration. Disabling this forces brute-force comparison against all references.
align=t: Perform SSU/ITS alignment for ANI calculation. When enabled, top candidates are aligned using QuantumAligner for precise ANI scores.
banself=f: Skip self-comparisons when query and reference share a TaxID. Useful for leave-one-out evaluation.
k=19: K-mer length for hashing. Must match the reference database.
buckets=128: Number of DDL buckets. Must match the reference database.
exponent=4: DDL exponent bits. Must match the reference database.
t=auto: Number of threads. Defaults to all available cores. Higher values improve speed for batch queries in local mode.

Output Parameters

format=tab: Output format. Options: tab (tab-delimited, default), json (JSON array).
sequence=f: Print the SSU or ITS sequence as the last output column. In lookup mode (name= or tid=), the sequence is always included in output.
rank=f: Print rank column in output.
lineage=f: Print full taxonomic lineage column in output.
printname=t: Show the Name column in output.
printtid=t: Show the TID column in output.
loud=f: Print detailed subphase timing and internal configuration. Useful for benchmarking and debugging.

Java Parameters

-Xmx: Set Java's memory usage, overriding autodetection. Example: -Xmx20g for 20 GB. The max is typically 85% of physical memory. Default is 3200m for client mode.
-eoom: Exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Output Columns

Default tab-delimited output includes the following columns:

ANI: Average Nucleotide Identity from SSU/ITS alignment (0–1 scale). The primary accuracy metric.
WKID: Weighted Kmer Identity from DDL sketch comparison (0–1 scale).
Matches: Number of shared DDL index keys between query and reference.
Type: Reference type: 16S, 18S, or ITS.
qLen / rLen: Query and reference SSU/ITS sequence lengths in bases.
TID: NCBI Taxonomy ID of the reference organism.
Name: Reference organism name from the TaxTree.
File / Contig / Start / Strand: Query source information: input filename, contig name, SSU/ITS start position, and strand (+/-).

In lookup mode (name= or tid=), the output columns are: TID, Type, Name, [Lineage], Sequence.

Resource Files

Required for local mode (automatically loaded from BBTools/resources/). Download missing files from SourceForge.

ssuSketchDDL.tsv.gz: SSU DDL reference sketches (276,000 organisms).
itsSketchDDL.tsv.gz: ITS DDL reference sketches (35,000 organisms).
all_prok_16S_best_taxsorted.fa.gz: 16S rRNA sequences for alignment-based ANI calculation.
all_euk_18S_best_taxsorted.fa.gz: 18S rRNA sequences for alignment-based ANI calculation.
all_ITS_best_taxsorted.fa.gz: ITS sequences for alignment-based ANI calculation.
16S_consensus_sequence.fa / 18S_consensus_sequence.fa: Primary consensus sequences used for fast 16S/18S type classification. Included with BBTools.
ITS_*_consensus_sequence.fq: ITS consensus sequences for fungi, plants, animals, and other eukaryotes. Used when a sequence does not confidently align to SSU consensus. Included with BBTools.
model.pgm: PGM gene-calling model for call mode. Included with BBTools.

How It Works

FindSSU identifies organisms by comparing ribosomal SSU and ITS sequences against a reference database of 312,000 sequences (276,000 SSU across 16S and 18S, and 35,000 ITS). The pipeline has four stages:

Type classification: Each query is aligned against 16S and 18S consensus sequences. Sequences with >64% ANI to SSU consensus are classified as 16S or 18S. Sequences with <56% ANI to all SSU consensuses (including per-clade variants) are classified as ITS. Ambiguous sequences are aligned against all reference types.
Sketching: The query is hashed into a compact DynamicDemiLog (DDL) sketch — a fixed-size signature of 128 buckets.
Index lookup: An inverted index identifies candidate references sharing at least minhits sketch keys with the query, reducing the search space from hundreds of thousands to typically a few dozen candidates.
Alignment and ranking: Top candidates are aligned using QuantumAligner for precise ANI scores, then results are re-sorted by alignment ANI.

In client mode (the default), the query sketch is sent to the server, which performs steps 3–4. In call mode, gene-calling extracts SSU genes from genomic input before entering the pipeline; the gene-calling itself runs locally even in server mode. Lookup by name, TaxID, or literal sequence also routes through the server in client mode.

ITS Support

FindSSU classifies ITS (Internal Transcribed Spacer) sequences alongside 16S and 18S SSU. ITS sequences are the non-coding spacer regions between ribosomal subunit genes. They are widely used for fungal, plant, and other eukaryotic taxonomy because they evolve faster than SSU and provide finer species-level resolution.

Classification works by a two-pass consensus alignment. First, the query is aligned to universal 16S and 18S consensus sequences. If the best SSU alignment is below the ITS ceiling (56% ANI), the sequence is classified as ITS and compared against ITS references only. This avoids misclassification of non-ribosomal sequences and keeps the ITS search to the relevant reference subset.

The ITS reference database includes 35,000 sequences spanning fungi (the dominant group in standard ITS databases), plants, animals, and other eukaryotes. Multiple ITS consensus sequences are maintained per clade to handle the high variability of ITS regions across kingdoms.

ITS Usage Examples

# Classify an ITS sequence against the full reference database
findssu.sh its_sequences.fa

# Fetch the ITS sequence for a fungus by name
findssu.sh name=Saccharomyces_cerevisiae its

# Fetch both ITS and 16S records for a taxon by TaxID
findssu.sh tid=4932 its 16s

# Force local classification using a custom ITS reference
findssu.sh its_sequences.fa refits=custom_its.tsv local

Sequence Fetching

The name= and tid= modes retrieve reference SSU or ITS sequences from the database rather than classifying a query. This works in both server mode (default) and local mode. The output includes the TaxID, sequence type, organism name, and the actual sequence.

The literal= mode classifies a raw sequence given directly on the command line, also supporting server mode. This is useful for scripting and quick lookups without creating a temporary file.

Sequence Fetching Examples

# Look up the 16S sequence for E. coli (server mode)
findssu.sh name=Escherichia_coli 16s

# Look up by NCBI TaxID (server mode)
findssu.sh tid=562

# Look up with abbreviated name
findssu.sh name=E.coli

# Classify a literal sequence without creating a file
findssu.sh literal=AGAGTTTGATCCTGGCTCAG...

# Fetch all records (SSU and ITS) for a yeast by TaxID
findssu.sh tid=4932

DDL Sketching

DynamicDemiLog is a sketch data structure that reduces a sequence to a fixed-size probabilistic summary. Each sequence is hashed with k-mers of length k, and the hashes are distributed across 128 buckets. Each bucket stores a compressed count using a floating-point-like encoding: 4 exponent bits and 12 mantissa bits (16 bits per bucket), for a total sketch size of 256 bytes.

Two sketches are compared by computing Weighted Kmer Identity (WKID) across all buckets. WKID measures the fraction of shared k-mer content between two sequences and correlates with ANI, though it systematically overestimates by approximately 1.2% for SSU data due to conservation clustering in ribosomal genes. The alignment step corrects for this.

Why 4 Exponent Bits

DDL bucket values use a floating-point encoding with E exponent bits and (16−E) mantissa bits. Analysis of all 276,772 reference sketches showed that with E=5, the top exponent bit is set only 0.044% of the time — essentially a wasted bit for SSU-length sequences (~1,500 bp). Reducing to E=4 reclaims that bit as mantissa, improving count precision. In benchmarks, E=4 was 37% faster than E=5 for indexed all-to-all comparison (447s vs 711s, 276k×276k, 32 threads) because the improved precision produces more distinct values per bucket, leading to shorter posting lists in the inverted index.

Inverted Index

The inverted index is a three-dimensional structure: int[buckets][values][], mapping each (bucket, value) pair to a list of reference IDs that share that signature. With 128 buckets and 16-bit values, the index has up to 128 × 65,536 = 8.4 million cells, of which roughly 15.7% are populated for the default k=19 reference database.

For a query, the index is probed at each of the 128 buckets using the query's bucket values. Each probe returns a posting list of reference IDs. References appearing in at least minhits posting lists are selected as comparison candidates. With the default minhits=8, this typically reduces the search space from hundreds of thousands of references to a few dozen candidates per query.

Soft Minhits Fallback

If the initial pass with minhits=8 yields fewer candidates than the buffer size, a second pass sweeps references with at least 1 shared key that were not compared in the first pass. This ensures sensitivity for divergent or unusual queries without slowing down the common case.

Why k=19 and minhits=8

A systematic sweep of k=13, 15, 17, 19, 21, and 25 was performed with posting list histograms at multiple minhits thresholds. At k=19 with minhits≥8, exactly 138 out of 276,772 queries (0.05%) find fewer than 5 reference hits through the index — 99.95% coverage. Lower k values produce longer posting lists (more collisions, slower lookups); higher k values reduce sensitivity for divergent pairs. k=19 with minhits=8 was the empirically optimal balance of speed and sensitivity for this reference set.

Alignment Pipeline

After index filtering, candidates are scored by a composite metric: WKID × √matches. A heap collects the top max(buffer, 20 + 2×records) candidates by this composite score. These candidates are then aligned against the query using QuantumAligner, which computes true SSU or ITS alignment ANI.

After alignment, results are re-sorted by a combined score: 1000 × ANI + composite. This ensures that the final ranking is dominated by alignment ANI (the more accurate metric) with the composite score breaking ties. The list is then trimmed to the requested number of records.

This two-phase approach bounds alignment cost — at most a few dozen alignments per query instead of hundreds of thousands — while ensuring that the best match by true ANI is captured. The buffer is deliberately oversized relative to the output to account for cases where the best-by-ANI match is not the best-by-sketch match.

Related Tools

QuickClade — Genome-level taxonomic classification using whole-genome DDL sketches. Use FindSSU for amplicon or SSU/ITS gene sequences; use QuickClade for assembled genomes.
BBSketch — General-purpose DDL sketching tool for building custom reference databases or sketching arbitrary sequences.