QuickClade

Script: quickclade.sh Package: bin Class: CladeSearcher.java

Assigns taxonomy to query sequences by comparing kmer frequencies to those in a reference database. Developed for taxonomic assignment of metagenomic bins, but it can also run on a per-sequence basis. QuickClade uses a hierarchical k-mer comparison algorithm (3, 4, and 5-mers) with early exit optimization and fixed 4GB memory allocation that doesn't scale with input size. However, the accuracy declines for incomplete genomes. The recommended minimum sequence length is not yet known, but lower values of k5dif are more likely to be correct to a lower taxonomic level. k5dif represents the sum of the absolute values of the differences between the 5-mer frequency spectra, so the range is 0-1. Because no marker genes are used, QuickClade should perform similarly for any clade in the reference dataset. While the default reference is taxonomically labeled, you can use whatever you want as a reference, with or without taxonomic labels.

Basic Usage

quickclade.sh query1.fa query2.fa query3.fa
quickclade.sh bins
quickclade.sh contigs.fa percontig out=results.tsv usetree

QuickClade accepts multiple query files or directories as input, and can process them either as individual files or on a per-contig basis for more granular taxonomic assignment.

Parameters

Parameters are organized by their function in the taxonomic assignment process. All parameters from the shell script are documented to ensure complete functionality coverage.

File Parameters

in=<file,file>
Query files or directories. Loose file or directory names are also permitted. Input can be fasta, fastq, or spectra files; spectra files are made by cladeloader.sh.
ref=<file,file>
Reference files; the current default is: /clusterfs/jgi/groups/gentech/homes/bbushnell/clade/refseq_main.spectra.gz. It is plaintext, human-readable, and pretty small.
out=stdout
Set to a file to redirect output. Only the query results will be written here; progress messages will still go to stderr.

Basic Parameters

percontig
Run one query per contig instead of per file. This enables more detailed taxonomic assignment for individual contigs within multi-contig files.
minlen=0
Ignore sequences shorter than this in percontig mode. Helps filter out very short contigs that may not provide reliable taxonomic signal.
hits=1
Print this many top hits per query. Increasing this value provides multiple potential taxonomic assignments for each query.
steps=7
Only search up to this many GC intervals (of 0.01) away from the query GC. Limits search space by GC content similarity for faster processing.
oneline
Print results one line per query, tab-delimited. Provides machine-readable output format for downstream processing.
callssu=f
Call 16S and 18S for alignment to reference SSU. This will affect the top hit ordering only if hits>1. Enhances taxonomic assignment accuracy by incorporating ribosomal RNA information.

Advanced Parameters (mainly for benchmarking)

printmetrics
Output accuracy statistics; mainly useful for labeled data. Labeled data should have 'tid_1234' or similar in the header. Works best with 'usetree'. Provides detailed performance metrics for evaluation purposes.
printqtid
Print query TaxID. Useful for evaluation when query sequences have known taxonomic identifiers.
banself
Ignore records with the same TaxID as the query. Makes the program behave like that organism is not in the reference. Useful for leave-one-out cross-validation studies.
simd
Use vector instructions to accelerate comparisons. Enables SIMD (Single Instruction, Multiple Data) optimizations for faster k-mer frequency calculations.
maxk=5
Can be set to 4 or 3 to restrict kmer frequency comparisons to smaller kmers. This may improve accuracy for small sequences/bins, but slightly reduces accuracy for large sequences/bins.
ccm=1.0
Threshold for using pentamers; lower is faster. Controls when 5-mer comparisons are performed based on preliminary screening results.
ccm2=1.5
Threshold for using tetramers. Controls when 4-mer comparisons are performed, providing a balance between speed and accuracy.
gcdif=0.07
Initial maximum GC difference. Sets the initial tolerance for GC content differences between query and reference sequences.
strdif=0.10
Initial maximum strandedness difference. Controls tolerance for differences in strand bias between sequences.
gcmult=0.5
Max GC difference as a fraction of best 5-mer difference. Dynamically adjusts GC tolerance based on k-mer similarity.
strmult=1.2
Max strandedness difference as a fraction of best 5-mer diff. Dynamically adjusts strand bias tolerance based on k-mer similarity.
ee=t
Early exit; increases speed. Enables early termination of comparisons when sufficient confidence is achieved, improving processing speed.
entropy
Calculate entropy for queries. Slow; negligible utility. Computes sequence complexity metrics that have minimal impact on classification accuracy.
heap=1
Number of intermediate comparisons to store. Controls memory usage for tracking top candidate matches during processing.
usetree
Load a taxonomic tree for better grading for labeled data. Enables phylogenetically-aware evaluation metrics and improves accuracy assessment.
aligner=quantum
Options include ssa2, glocal, drifting, banded, crosscut. Specifies the alignment algorithm used for SSU sequence comparison when callssu is enabled.

Distance Metrics

abs
Use absolute difference of kmer frequencies. Computes L1 distance between k-mer frequency vectors.
cos
Use 1-cosine similarity of kmer frequencies. Measures the angular distance between k-mer frequency vectors, emphasizing relative composition over absolute counts.
euc
Use Euclidian distance. Computes L2 distance between k-mer frequency vectors.
hel
Use Hellinger distance. A probabilistic distance metric that treats k-mer frequencies as probability distributions.
abscomp
GC-compensated version of abs (default). Uses absolute difference with GC content normalization to reduce bias from compositional differences.

Note: The distance metric strongly impacts ccm, gcmult, and strmult. Defaults are optimized for abscomp.

Examples

Basic Taxonomic Assignment

quickclade.sh query1.fa query2.fa query3.fa

Assigns taxonomy to multiple query files using the default reference database.

Directory Processing

quickclade.sh bins

Processes all sequence files in the 'bins' directory for taxonomic assignment.

Per-Contig Analysis with Output File

quickclade.sh contigs.fa percontig out=results.tsv usetree

Analyzes each contig separately, writes results to a tab-delimited file, and uses taxonomic tree for enhanced evaluation.

Multiple Hits with Machine-Readable Output

quickclade.sh query.fa hits=5 oneline out=top5_hits.tsv

Reports the top 5 taxonomic matches for each query in a single-line, tab-delimited format suitable for automated processing.

Benchmarking with Metrics

quickclade.sh labeled_queries.fa printmetrics usetree banself out=evaluation.txt

Evaluates performance using labeled queries, excludes self-matches, and outputs detailed accuracy metrics.

Custom Distance Metric

quickclade.sh query.fa cos ccm=0.8 ccm2=1.2

Uses cosine similarity distance metric with adjusted thresholds for 4-mer and 5-mer comparisons.

Algorithm Details

K-mer Frequency Profiling

QuickClade employs a hierarchical k-mer frequency comparison algorithm that operates on multiple k-mer sizes (3, 4, and 5-mers by default). The core approach involves:

Distance Metric Selection

The choice of distance metric significantly impacts both accuracy and computational requirements:

Multi-threaded Architecture

CladeSearcher implements a round-robin thread assignment system:

Memory Efficiency

The system is designed for minimal memory usage through several strategies:

Entropy Adjustment Model

QuickClade incorporates an entropy adjustment system that:

Taxonomic Tree Integration

When usetree is enabled, the system provides enhanced evaluation capabilities:

Performance Characteristics

QuickClade implements several performance optimizations:

Support

For questions and support: