QuickClade

Script: quickclade.sh Package: bin Class: CladeSearcher.java

Assigns taxonomy to query sequences by comparing kmer frequencies to those in a reference database. Developed for taxonomic assignment of metagenomic bins, but it can also run on a per-sequence basis. QuickClade uses a hierarchical k-mer comparison algorithm (3, 4, and 5-mers) with early exit optimization and fixed 4GB memory allocation that doesn't scale with input size. However, the accuracy declines for incomplete genomes. The recommended minimum sequence length is not yet known, but lower values of k5dif are more likely to be correct to a lower taxonomic level. k5dif represents the sum of the absolute values of the differences between the 5-mer frequency spectra, so the range is 0-1. Because no marker genes are used, QuickClade should perform similarly for any clade in the reference dataset. While the default reference is taxonomically labeled, you can use whatever you want as a reference, with or without taxonomic labels.

Basic Usage

quickclade.sh query1.fa query2.fa query3.fa
quickclade.sh bins
quickclade.sh contigs.fa percontig out=results.tsv usetree

QuickClade accepts multiple query files or directories as input, and can process them either as individual files or on a per-contig basis for more granular taxonomic assignment.

Parameters

Parameters are organized by their function in the taxonomic assignment process. All parameters from the shell script are documented to ensure complete functionality coverage.

File Parameters

in=<file,file>: Query files or directories. Loose file or directory names are also permitted. Input can be fasta, fastq, or spectra files; spectra files are made by cladeloader.sh.
ref=<file,file>: Reference files; the current default is: /clusterfs/jgi/groups/gentech/homes/bbushnell/clade/refseq_main.spectra.gz. It is plaintext, human-readable, and pretty small.
out=stdout: Set to a file to redirect output. Only the query results will be written here; progress messages will still go to stderr.

Basic Parameters

percontig: Run one query per contig instead of per file. This enables more detailed taxonomic assignment for individual contigs within multi-contig files.
minlen=0: Ignore sequences shorter than this in percontig mode. Helps filter out very short contigs that may not provide reliable taxonomic signal.
hits=1: Print this many top hits per query. Increasing this value provides multiple potential taxonomic assignments for each query.
steps=7: Only search up to this many GC intervals (of 0.01) away from the query GC. Limits search space by GC content similarity for faster processing.
oneline: Print results one line per query, tab-delimited. Provides machine-readable output format for downstream processing.
callssu=f: Call 16S and 18S for alignment to reference SSU. This will affect the top hit ordering only if hits>1. Enhances taxonomic assignment accuracy by incorporating ribosomal RNA information.

Advanced Parameters (mainly for benchmarking)

printmetrics: Output accuracy statistics; mainly useful for labeled data. Labeled data should have 'tid_1234' or similar in the header. Works best with 'usetree'. Provides detailed performance metrics for evaluation purposes.
printqtid: Print query TaxID. Useful for evaluation when query sequences have known taxonomic identifiers.
banself: Ignore records with the same TaxID as the query. Makes the program behave like that organism is not in the reference. Useful for leave-one-out cross-validation studies.
simd: Use vector instructions to accelerate comparisons. Enables SIMD (Single Instruction, Multiple Data) optimizations for faster k-mer frequency calculations.
maxk=5: Can be set to 4 or 3 to restrict kmer frequency comparisons to smaller kmers. This may improve accuracy for small sequences/bins, but slightly reduces accuracy for large sequences/bins.
ccm=1.0: Threshold for using pentamers; lower is faster. Controls when 5-mer comparisons are performed based on preliminary screening results.
ccm2=1.5: Threshold for using tetramers. Controls when 4-mer comparisons are performed, providing a balance between speed and accuracy.
gcdif=0.07: Initial maximum GC difference. Sets the initial tolerance for GC content differences between query and reference sequences.
strdif=0.10: Initial maximum strandedness difference. Controls tolerance for differences in strand bias between sequences.
gcmult=0.5: Max GC difference as a fraction of best 5-mer difference. Dynamically adjusts GC tolerance based on k-mer similarity.
strmult=1.2: Max strandedness difference as a fraction of best 5-mer diff. Dynamically adjusts strand bias tolerance based on k-mer similarity.
ee=t: Early exit; increases speed. Enables early termination of comparisons when sufficient confidence is achieved, improving processing speed.
entropy: Calculate entropy for queries. Slow; negligible utility. Computes sequence complexity metrics that have minimal impact on classification accuracy.
heap=1: Number of intermediate comparisons to store. Controls memory usage for tracking top candidate matches during processing.
usetree: Load a taxonomic tree for better grading for labeled data. Enables phylogenetically-aware evaluation metrics and improves accuracy assessment.
aligner=quantum: Options include ssa2, glocal, drifting, banded, crosscut. Specifies the alignment algorithm used for SSU sequence comparison when callssu is enabled.

Distance Metrics

abs: Use absolute difference of kmer frequencies. Computes L1 distance between k-mer frequency vectors.
cos: Use 1-cosine similarity of kmer frequencies. Measures the angular distance between k-mer frequency vectors, emphasizing relative composition over absolute counts.
euc: Use Euclidian distance. Computes L2 distance between k-mer frequency vectors.
hel: Use Hellinger distance. A probabilistic distance metric that treats k-mer frequencies as probability distributions.
abscomp: GC-compensated version of abs (default). Uses absolute difference with GC content normalization to reduce bias from compositional differences.

Note: The distance metric strongly impacts ccm, gcmult, and strmult. Defaults are optimized for abscomp.

Examples

Basic Taxonomic Assignment

quickclade.sh query1.fa query2.fa query3.fa

Assigns taxonomy to multiple query files using the default reference database.

Directory Processing

quickclade.sh bins

Processes all sequence files in the 'bins' directory for taxonomic assignment.

Per-Contig Analysis with Output File

quickclade.sh contigs.fa percontig out=results.tsv usetree

Analyzes each contig separately, writes results to a tab-delimited file, and uses taxonomic tree for enhanced evaluation.

Multiple Hits with Machine-Readable Output

quickclade.sh query.fa hits=5 oneline out=top5_hits.tsv

Reports the top 5 taxonomic matches for each query in a single-line, tab-delimited format suitable for automated processing.

Benchmarking with Metrics

quickclade.sh labeled_queries.fa printmetrics usetree banself out=evaluation.txt

Evaluates performance using labeled queries, excludes self-matches, and outputs detailed accuracy metrics.

Custom Distance Metric

quickclade.sh query.fa cos ccm=0.8 ccm2=1.2

Uses cosine similarity distance metric with adjusted thresholds for 4-mer and 5-mer comparisons.

Algorithm Details

K-mer Frequency Profiling

QuickClade employs a hierarchical k-mer frequency comparison algorithm that operates on multiple k-mer sizes (3, 4, and 5-mers by default). The core approach involves:

Multi-scale Analysis: Uses 3-mer, 4-mer, and 5-mer frequencies captured in canonical form to capture different levels of genomic signal
Threshold-Based Screening: ccm=1.0 controls progression to 5-mer comparisons, ccm2=1.5 controls 4-mer comparisons
maxk Parameter: Can restrict comparisons to smaller k-mers (maxk=4 or maxk=3) for small sequences/bins

Distance Metric Selection

The choice of distance metric significantly impacts both accuracy and computational requirements:

AbsComp (Default): GC-compensated absolute difference that normalizes for compositional bias while maintaining computational efficiency
Cosine Similarity: Emphasizes relative k-mer composition patterns, useful for sequences with varying coverage or fragmentation
Hellinger Distance: Probabilistic approach that treats k-mer counts as probability distributions, providing robust comparisons

Multi-threaded Architecture

CladeSearcher implements a round-robin thread assignment system:

Thread Count Calculation: Uses Tools.mid(1, Shared.threads(), Tools.min(maxCompareThreads, queries.size()/16)) to determine optimal threads
Round-Robin Assignment: Each thread processes queries at indices tid, tid+threads, tid+2*threads to ensure even distribution
Index Cloning: Each ProcessThread gets index.clone() to avoid synchronization bottlenecks during search
Accumulator Pattern: Uses ThreadWaiter.startAndWait() with accumulate() method to collect thread results

Memory Efficiency

The system is designed for minimal memory usage through several strategies:

Streaming Processing: Queries are processed individually rather than loading all into memory simultaneously
Compressed References: Uses gzipped spectra files that are human-readable but space-efficient
Controlled Heap Usage: The heap parameter limits intermediate comparison storage to prevent memory bloat
Early Exit Optimization: Terminates comparisons early when sufficient confidence is achieved, reducing computational overhead

Entropy Adjustment Model

QuickClade incorporates an entropy adjustment system that:

Automatic Model Loading: Constructor checks if AdjustEntropy.kLoaded!=4 || AdjustEntropy.wLoaded!=150, then calls AdjustEntropy.load(4, 150)
Model Parameters: Uses k=4 and window=150 specifically for entropy calculations and sequence complexity assessment
Setup Integration: The setup() method reloads entropy models if calcCladeEntropy is enabled and models don't match expected parameters

Taxonomic Tree Integration

When usetree is enabled, the system provides enhanced evaluation capabilities:

Phylogenetic Awareness: Considers taxonomic relationships when evaluating classification accuracy
Level-Specific Metrics: Reports accuracy at different taxonomic levels (species, genus, family, etc.)
Weighted Scoring: Provides phylogenetically-weighted accuracy scores that account for evolutionary relationships

Performance Characteristics

QuickClade implements several performance optimizations:

Fixed Memory Usage: Uses parseJavaArgs("--mem=4g", "--mode=fixed") for consistent 4GB allocation regardless of input size
Early Exit Optimization: The ee=t parameter enables termination of comparisons when sufficient confidence is achieved
Compressed References: Uses gzipped .spectra.gz files that are human-readable but space-efficient
SIMD Support: Optional vector instructions (simd parameter) for hardware acceleration of k-mer frequency calculations

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org