FindSSU

Script: findssu.sh Package: ddl Class: SSUCompare.java

Identifies and classifies ribosomal SSU (16S/18S) and ITS sequences using DynamicDemiLog (DDL) sketching against a pre-built reference database of 312,000 sequences (221,000 prokaryotic 16S, 55,000 eukaryotic 18S, and 35,000 fungal ITS). Query type is determined automatically: sequences aligning with >64% ANI to SSU consensus are classified as 16S or 18S; those aligning with <56% ANI are classified as ITS; others are compared against all types. By default, FindSSU operates in client mode and sends queries to a remote server for classification, requiring no local database. Use the local flag to run entirely locally.

Also available as a web service for browser-based access.

Basic Usage

findssu.sh ssu1.fa ssu2.fa
findssu.sh genome.fa call
findssu.sh literal=ACGTACGT...
findssu.sh name=Escherichia_coli
findssu.sh name=Saccharomyces_cerevisiae its
findssu.sh tid=562
findssu.sh ssu.fa local
findssu.sh ssu.fa ref16s=custom_16S.tsv ref18s=custom_18S.tsv

By default, FindSSU sends sequences to a remote classification server. In call mode, gene-calling is performed locally and only the extracted SSU sequences are sent to the server, minimizing network traffic. Use local to load the reference database locally and skip the server entirely.

Parameters

Input/Output Parameters

in=<file>
Input file(s). Loose filenames are also accepted as positional arguments. Accepts FASTA, FASTQ, and gzipped variants.
ref=<file>
Pre-built SSU DDL reference file (TSV format). Default: resources/ssuSketchDDL.tsv.gz. Only needed in local mode.
ref16s=<file>
Separate 16S reference file. Overrides the default combined reference for 16S sequences.
ref18s=<file>
Separate 18S reference file. Overrides the default combined reference for 18S sequences.
qf=<file>
Pre-built DDL query file for batch comparison. Queries are loaded from this file instead of raw sequences.

Mode Parameters

call
Enable gene-calling mode. Input is treated as genomic sequence; SSU genes are found via PGM model, then each is individually classified. In client mode, gene-calling runs locally and only extracted SSUs are sent to the server.
literal=<seq>
Provide a query sequence directly on the command line instead of from a file. Example: literal=GATGAACGCTGGCGG...
name=<name>
Look up a reference by organism name instead of comparing sequences. Accepts full names (name=Escherichia_coli), abbreviated genus.species (name=E.coli), or partial prefix matches. Outputs TID, Type, Name, and Sequence. Works via server (default) or locally.
tid=<int>
Look up a reference by NCBI TaxID (e.g. tid=562). Outputs TID, Type, Name, and Sequence. Works via server (default) or locally.
its / 16s / 18s / ssu
In lookup mode, return only records of the specified type. Example: findssu.sh name=Saccharomyces_cerevisiae its returns only ITS records. ssu matches both 16S and 18S. Flags are combinable and work in both local and server mode. Over HTTP, use bare prefixes (//its) or key-value (//rtype=its).
local
Force local mode. Loads the reference database and performs all classification locally without contacting the server. Requires the SSU DDL reference file in resources/.
address=<url>
Override the default server address for client mode. Example: address=http://myserver:3070/.

Comparison Parameters

records=5
Maximum hits to display per query sequence.
minhits=8
Minimum shared index keys required to compare a reference. Lower values increase sensitivity but slow down the search.
buffer=0
Alignment buffer size. After index filtering, the top max(buffer, 20+2*records) candidates are aligned, then re-sorted by alignment ANI. Bounds alignment cost while ensuring the best match is captured.
index=t
Use inverted index for query acceleration. Disabling this forces brute-force comparison against all references.
align=t
Perform SSU alignment for ANI calculation. When enabled, top candidates are aligned using QuantumAligner for precise ANI scores.
banself=f
Skip self-comparisons when query and reference share a TaxID. Useful for leave-one-out evaluation.
k=19
K-mer length for hashing. Must match the reference database.
buckets=128
Number of DDL buckets. Must match the reference database.
exponent=4
DDL exponent bits. Must match the reference database.
t=auto
Number of threads. Defaults to all available cores. Higher values improve speed for batch queries in local mode.

Output Parameters

format=tab
Output format. Options: tab (tab-delimited, default), json (JSON array).
sequence=f
Print SSU sequence as the last output column.
rank=f
Print rank column in output.
lineage=f
Print full taxonomic lineage column in output.
printname=t
Show the Name column in output.
printtid=t
Show the TID column in output.
loud=f
Print detailed subphase timing and internal configuration. Useful for benchmarking and debugging.

Java Parameters

-Xmx
Set Java's memory usage, overriding autodetection. Example: -Xmx20g for 20 GB. The max is typically 85% of physical memory. Default is 3200m for client mode.
-eoom
Exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Output Columns

Default tab-delimited output includes the following columns:

ANI
Average Nucleotide Identity from SSU alignment (0-1 scale). The primary accuracy metric.
WKID
Weighted Kmer Identity from DDL sketch comparison (0-1 scale).
Matches
Number of shared DDL index keys between query and reference.
Type
Reference type: 16S, 18S, or ITS.
qLen / rLen
Query and reference SSU sequence lengths in bases.
TID
NCBI Taxonomy ID of the reference organism.
Name
Reference organism name from the TaxTree.
File / Contig / Start / Strand
Query source information: input filename, contig name, SSU start position, and strand (+/-).

Resource Files

Required for local mode (automatically loaded from BBTools/resources/):

ssuSketchDDL.tsv.gz
Combined SSU+ITS DDL reference sketches (312,000 sequences, ~68 MB). Download from SourceForge if missing.
all_prok_16S_best_taxsorted.fa.gz
16S rRNA sequences for alignment-based ANI calculation.
all_euk_18S_best_taxsorted.fa.gz
18S rRNA sequences for alignment-based ANI calculation.
all_ITS_best_taxsorted.fa.gz
ITS sequences for alignment-based ANI calculation.
16S_consensus_sequence.fa / 18S_consensus_sequence.fa
Consensus sequences for 16S/18S type classification. Included with BBTools.
model.pgm
PGM gene-calling model for call mode. Included with BBTools.

How It Works

FindSSU identifies organisms by comparing ribosomal SSU and ITS sequences against a reference database of 312,000 sequences (221,000 prokaryotic 16S, 55,000 eukaryotic 18S, and 35,000 fungal ITS). The pipeline has four stages:

  1. Type classification: Each query is aligned against 16S and 18S consensus sequences. Sequences with >64% ANI to SSU consensus are classified as 16S or 18S; those with <56% ANI to all SSU consensuses are classified as ITS; others are compared against all reference types.
  2. Sketching: The query is hashed into a compact DynamicDemiLog (DDL) sketch — a fixed-size signature of 128 buckets.
  3. Index lookup: An inverted index identifies candidate references sharing at least minhits sketch keys with the query, reducing the search space from 276,000 to typically a few dozen candidates.
  4. Alignment and ranking: Top candidates are aligned using QuantumAligner for precise ANI scores, then results are re-sorted by alignment ANI.

In client mode (the default), steps 1–2 run locally and the sketch is sent to the server, which performs steps 3–4. In call mode, gene-calling extracts SSU genes from genomic input before entering the pipeline.

DDL Sketching

DynamicDemiLog is a sketch data structure that reduces a sequence to a fixed-size probabilistic summary. Each sequence is hashed with k-mers of length k, and the hashes are distributed across 128 buckets. Each bucket stores a compressed count using a floating-point-like encoding: 4 exponent bits and 12 mantissa bits (16 bits per bucket), for a total sketch size of 256 bytes.

Two sketches are compared by computing Weighted Kmer Identity (WKID) across all buckets. WKID measures the fraction of shared k-mer content between two sequences and correlates with ANI, though it systematically overestimates by approximately 1.2% for SSU data due to conservation clustering in ribosomal genes. The alignment step corrects for this.

Why 4 Exponent Bits

DDL bucket values use a floating-point encoding with E exponent bits and (16−E) mantissa bits. Analysis of all 276,772 reference sketches showed that with E=5, the top exponent bit is set only 0.044% of the time — essentially a wasted bit for SSU-length sequences (~1,500 bp). Reducing to E=4 reclaims that bit as mantissa, improving count precision. In benchmarks, E=4 was 37% faster than E=5 for indexed all-to-all comparison (447s vs 711s, 276k×276k, 32 threads) because the improved precision produces more distinct values per bucket, leading to shorter posting lists in the inverted index.

Inverted Index

The inverted index is a three-dimensional structure: int[buckets][values][], mapping each (bucket, value) pair to a list of reference IDs that share that signature. With 128 buckets and 16-bit values, the index has up to 128 × 65,536 = 8.4 million cells, of which roughly 15.7% are populated for the default k=19 reference database.

For a query, the index is probed at each of the 128 buckets using the query's bucket values. Each probe returns a posting list of reference IDs. References appearing in at least minhits posting lists are selected as comparison candidates. With the default minhits=8, this typically reduces the search space from 276,000 references to a few dozen candidates per query.

Soft Minhits Fallback

If the initial pass with minhits=8 yields fewer candidates than the buffer size, a second pass sweeps references with at least 1 shared key that were not compared in the first pass. This ensures sensitivity for divergent or unusual queries without slowing down the common case.

Why k=19 and minhits=8

A systematic sweep of k=13, 15, 17, 19, 21, and 25 was performed with posting list histograms at multiple minhits thresholds. At k=19 with minhits≥8, exactly 138 out of 276,772 queries (0.05%) find fewer than 5 reference hits through the index — 99.95% coverage. Lower k values produce longer posting lists (more collisions, slower lookups); higher k values reduce sensitivity for divergent pairs. k=19 with minhits=8 was the empirically optimal balance of speed and sensitivity for this reference set.

Alignment Pipeline

After index filtering, candidates are scored by a composite metric: WKID × √matches. A heap collects the top max(buffer, 20 + 2×records) candidates by this composite score. These candidates are then aligned against the query using QuantumAligner, which computes true SSU alignment ANI.

After alignment, results are re-sorted by a combined score: 1000 × ANI + composite. This ensures that the final ranking is dominated by alignment ANI (the more accurate metric) with the composite score breaking ties. The list is then trimmed to the requested number of records.

This two-phase approach bounds alignment cost — at most a few dozen alignments per query instead of 276,000 — while ensuring that the best match by true ANI is captured. The buffer is deliberately oversized relative to the output to account for cases where the best-by-ANI match is not the best-by-sketch match.