QuickClade
Assigns taxonomy to query sequences by comparing kmer frequencies to those in a reference database. Developed for taxonomic assignment of metagenomic bins, but it can also run on a per-sequence basis. QuickClade uses a hierarchical k-mer comparison algorithm (3, 4, and 5-mers) with early exit optimization and fixed 4GB memory allocation that doesn't scale with input size. However, the accuracy declines for incomplete genomes. The recommended minimum sequence length is not yet known, but lower values of k5dif are more likely to be correct to a lower taxonomic level. k5dif represents the sum of the absolute values of the differences between the 5-mer frequency spectra, so the range is 0-1. Because no marker genes are used, QuickClade should perform similarly for any clade in the reference dataset. While the default reference is taxonomically labeled, you can use whatever you want as a reference, with or without taxonomic labels.
Basic Usage
quickclade.sh query1.fa query2.fa query3.fa
quickclade.sh bins
quickclade.sh contigs.fa percontig out=results.tsv usetree
QuickClade accepts multiple query files or directories as input, and can process them either as individual files or on a per-contig basis for more granular taxonomic assignment.
Parameters
Parameters are organized by their function in the taxonomic assignment process. All parameters from the shell script are documented to ensure complete functionality coverage.
File Parameters
- in=<file,file>
- Query files or directories. Loose file or directory names are also permitted. Input can be fasta, fastq, or spectra files; spectra files are made by cladeloader.sh.
- ref=<file,file>
- Reference files; the current default is: /clusterfs/jgi/groups/gentech/homes/bbushnell/clade/refseq_main.spectra.gz. It is plaintext, human-readable, and pretty small.
- out=stdout
- Set to a file to redirect output. Only the query results will be written here; progress messages will still go to stderr.
Basic Parameters
- percontig
- Run one query per contig instead of per file. This enables more detailed taxonomic assignment for individual contigs within multi-contig files.
- minlen=0
- Ignore sequences shorter than this in percontig mode. Helps filter out very short contigs that may not provide reliable taxonomic signal.
- hits=1
- Print this many top hits per query. Increasing this value provides multiple potential taxonomic assignments for each query.
- steps=7
- Only search up to this many GC intervals (of 0.01) away from the query GC. Limits search space by GC content similarity for faster processing.
- oneline
- Print results one line per query, tab-delimited. Provides machine-readable output format for downstream processing.
- callssu=f
- Call 16S and 18S for alignment to reference SSU. This will affect the top hit ordering only if hits>1. Enhances taxonomic assignment accuracy by incorporating ribosomal RNA information.
Advanced Parameters (mainly for benchmarking)
- printmetrics
- Output accuracy statistics; mainly useful for labeled data. Labeled data should have 'tid_1234' or similar in the header. Works best with 'usetree'. Provides detailed performance metrics for evaluation purposes.
- printqtid
- Print query TaxID. Useful for evaluation when query sequences have known taxonomic identifiers.
- banself
- Ignore records with the same TaxID as the query. Makes the program behave like that organism is not in the reference. Useful for leave-one-out cross-validation studies.
- simd
- Use vector instructions to accelerate comparisons. Enables SIMD (Single Instruction, Multiple Data) optimizations for faster k-mer frequency calculations.
- maxk=5
- Can be set to 4 or 3 to restrict kmer frequency comparisons to smaller kmers. This may improve accuracy for small sequences/bins, but slightly reduces accuracy for large sequences/bins.
- ccm=1.0
- Threshold for using pentamers; lower is faster. Controls when 5-mer comparisons are performed based on preliminary screening results.
- ccm2=1.5
- Threshold for using tetramers. Controls when 4-mer comparisons are performed, providing a balance between speed and accuracy.
- gcdif=0.07
- Initial maximum GC difference. Sets the initial tolerance for GC content differences between query and reference sequences.
- strdif=0.10
- Initial maximum strandedness difference. Controls tolerance for differences in strand bias between sequences.
- gcmult=0.5
- Max GC difference as a fraction of best 5-mer difference. Dynamically adjusts GC tolerance based on k-mer similarity.
- strmult=1.2
- Max strandedness difference as a fraction of best 5-mer diff. Dynamically adjusts strand bias tolerance based on k-mer similarity.
- ee=t
- Early exit; increases speed. Enables early termination of comparisons when sufficient confidence is achieved, improving processing speed.
- entropy
- Calculate entropy for queries. Slow; negligible utility. Computes sequence complexity metrics that have minimal impact on classification accuracy.
- heap=1
- Number of intermediate comparisons to store. Controls memory usage for tracking top candidate matches during processing.
- usetree
- Load a taxonomic tree for better grading for labeled data. Enables phylogenetically-aware evaluation metrics and improves accuracy assessment.
- aligner=quantum
- Options include ssa2, glocal, drifting, banded, crosscut. Specifies the alignment algorithm used for SSU sequence comparison when callssu is enabled.
Distance Metrics
- abs
- Use absolute difference of kmer frequencies. Computes L1 distance between k-mer frequency vectors.
- cos
- Use 1-cosine similarity of kmer frequencies. Measures the angular distance between k-mer frequency vectors, emphasizing relative composition over absolute counts.
- euc
- Use Euclidian distance. Computes L2 distance between k-mer frequency vectors.
- hel
- Use Hellinger distance. A probabilistic distance metric that treats k-mer frequencies as probability distributions.
- abscomp
- GC-compensated version of abs (default). Uses absolute difference with GC content normalization to reduce bias from compositional differences.
Note: The distance metric strongly impacts ccm, gcmult, and strmult. Defaults are optimized for abscomp.
Examples
Basic Taxonomic Assignment
quickclade.sh query1.fa query2.fa query3.fa
Assigns taxonomy to multiple query files using the default reference database.
Directory Processing
quickclade.sh bins
Processes all sequence files in the 'bins' directory for taxonomic assignment.
Per-Contig Analysis with Output File
quickclade.sh contigs.fa percontig out=results.tsv usetree
Analyzes each contig separately, writes results to a tab-delimited file, and uses taxonomic tree for enhanced evaluation.
Multiple Hits with Machine-Readable Output
quickclade.sh query.fa hits=5 oneline out=top5_hits.tsv
Reports the top 5 taxonomic matches for each query in a single-line, tab-delimited format suitable for automated processing.
Benchmarking with Metrics
quickclade.sh labeled_queries.fa printmetrics usetree banself out=evaluation.txt
Evaluates performance using labeled queries, excludes self-matches, and outputs detailed accuracy metrics.
Custom Distance Metric
quickclade.sh query.fa cos ccm=0.8 ccm2=1.2
Uses cosine similarity distance metric with adjusted thresholds for 4-mer and 5-mer comparisons.
Algorithm Details
K-mer Frequency Profiling
QuickClade employs a hierarchical k-mer frequency comparison algorithm that operates on multiple k-mer sizes (3, 4, and 5-mers by default). The core approach involves:
- Multi-scale Analysis: Uses 3-mer, 4-mer, and 5-mer frequencies captured in canonical form to capture different levels of genomic signal
- Threshold-Based Screening: ccm=1.0 controls progression to 5-mer comparisons, ccm2=1.5 controls 4-mer comparisons
- maxk Parameter: Can restrict comparisons to smaller k-mers (maxk=4 or maxk=3) for small sequences/bins
Distance Metric Selection
The choice of distance metric significantly impacts both accuracy and computational requirements:
- AbsComp (Default): GC-compensated absolute difference that normalizes for compositional bias while maintaining computational efficiency
- Cosine Similarity: Emphasizes relative k-mer composition patterns, useful for sequences with varying coverage or fragmentation
- Hellinger Distance: Probabilistic approach that treats k-mer counts as probability distributions, providing robust comparisons
Multi-threaded Architecture
CladeSearcher implements a round-robin thread assignment system:
- Thread Count Calculation: Uses Tools.mid(1, Shared.threads(), Tools.min(maxCompareThreads, queries.size()/16)) to determine optimal threads
- Round-Robin Assignment: Each thread processes queries at indices tid, tid+threads, tid+2*threads to ensure even distribution
- Index Cloning: Each ProcessThread gets index.clone() to avoid synchronization bottlenecks during search
- Accumulator Pattern: Uses ThreadWaiter.startAndWait() with accumulate() method to collect thread results
Memory Efficiency
The system is designed for minimal memory usage through several strategies:
- Streaming Processing: Queries are processed individually rather than loading all into memory simultaneously
- Compressed References: Uses gzipped spectra files that are human-readable but space-efficient
- Controlled Heap Usage: The heap parameter limits intermediate comparison storage to prevent memory bloat
- Early Exit Optimization: Terminates comparisons early when sufficient confidence is achieved, reducing computational overhead
Entropy Adjustment Model
QuickClade incorporates an entropy adjustment system that:
- Automatic Model Loading: Constructor checks if AdjustEntropy.kLoaded!=4 || AdjustEntropy.wLoaded!=150, then calls AdjustEntropy.load(4, 150)
- Model Parameters: Uses k=4 and window=150 specifically for entropy calculations and sequence complexity assessment
- Setup Integration: The setup() method reloads entropy models if calcCladeEntropy is enabled and models don't match expected parameters
Taxonomic Tree Integration
When usetree is enabled, the system provides enhanced evaluation capabilities:
- Phylogenetic Awareness: Considers taxonomic relationships when evaluating classification accuracy
- Level-Specific Metrics: Reports accuracy at different taxonomic levels (species, genus, family, etc.)
- Weighted Scoring: Provides phylogenetically-weighted accuracy scores that account for evolutionary relationships
Performance Characteristics
QuickClade implements several performance optimizations:
- Fixed Memory Usage: Uses parseJavaArgs("--mem=4g", "--mode=fixed") for consistent 4GB allocation regardless of input size
- Early Exit Optimization: The ee=t parameter enables termination of comparisons when sufficient confidence is achieved
- Compressed References: Uses gzipped .spectra.gz files that are human-readable but space-efficient
- SIMD Support: Optional vector instructions (simd parameter) for hardware acceleration of k-mer frequency calculations
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org