testsketch.sh
Systematic sensitivity analysis pipeline for sketch-based taxonomic classification. Tests SendSketch performance across multiple taxonomic levels (strain to superkingdom) by progressively excluding closer relatives, enabling comprehensive evaluation of classification accuracy and sensitivity at different phylogenetic distances.
Purpose
This pipeline evaluates the sensitivity and accuracy of sketch-based taxonomic classification by systematically testing identification performance when progressively excluding closer taxonomic relatives. It helps determine the taxonomic resolution limits and optimal database configurations for specific classification needs.
Usage
testsketch.sh <query_file> <target_taxid> <database> [optional_flags] [additional_flags]
# Parameters:
# query_file - FASTA/FASTQ file to test
# target_taxid - NCBI taxonomy ID of the expected organism
# database - Sketch database to query against
# optional_flags - Additional SendSketch parameters
# additional_flags - More SendSketch parameters
Pipeline Analysis Levels
The pipeline tests taxonomic classification at 9 different taxonomic levels, progressively excluding closer relatives:
Level 1: Strain-Level Classification
sendsketch.sh $QUERY $DB -Xmx1g colors=f records=1 minhits=1 silent banunclassified printcommonancestorlevel tid=$TID include=$TID $EXTRA $EXTRA2
Test: Can the exact strain be identified when all references are available?
Level 2: Species-Level Classification
sendsketch.sh $QUERY $DB -Xmx1g colors=f records=1 minhits=1 silent banunclassified printcommonancestorlevel tid=$TID include=$TID includelevel=species exclude=$TID excludelevel=strain $EXTRA $EXTRA2
Test: Can the species be identified when the exact strain is excluded?
Level 3: Genus-Level Classification
sendsketch.sh $QUERY $DB -Xmx1g colors=f records=1 minhits=1 silent banunclassified printcommonancestorlevel tid=$TID include=$TID includelevel=genus exclude=$TID excludelevel=species $EXTRA $EXTRA2
Test: Can the genus be identified when the species and closer relatives are excluded?
Level 4: Family-Level Classification
sendsketch.sh $QUERY $DB -Xmx1g colors=f records=1 minhits=1 silent banunclassified printcommonancestorlevel tid=$TID include=$TID includelevel=family exclude=$TID excludelevel=genus $EXTRA $EXTRA2
Test: Can the family be identified when genus and closer relatives are excluded?
Levels 5-9: Higher Taxonomic Levels
The pipeline continues through progressively higher taxonomic levels:
- Order: Excludes family and below
- Class: Excludes order and below
- Phylum: Excludes class and below
- Superkingdom: Excludes phylum and below
- Life: Excludes superkingdom and below
Key Parameters
Each SendSketch call uses standardized parameters for consistent comparison:
Standard Parameters
-Xmx1g
- Use 1GB memory (adjustable for larger datasets)colors=f
- Disable colored output for cleaner loggingrecords=1
- Return only the top hit for focused analysisminhits=1
- Require at least one kmer hitsilent
- Suppress verbose outputbanunclassified
- Exclude unclassified sequencesprintcommonancestorlevel
- Report common ancestor taxonomy level
Taxonomic Control Parameters
tid=$TID
- Target taxonomy ID for analysisinclude=$TID
- Include target and descendantsincludelevel=X
- Include up to taxonomic level Xexclude=$TID
- Exclude target and descendantsexcludelevel=Y
- Exclude taxonomic level Y and below
Example Usage
# Test E. coli strain sensitivity using RefSeq database
./testsketch.sh ecoli_reads.fq 511145 refseq k=31
# Test with custom parameters
./testsketch.sh sample.fa 287 nt minhits=3 minani=0.85
# Test archaeal organism with specialized database
./testsketch.sh archaea.fq 2157 archaea.sketch records=5
Output Interpretation
The pipeline produces a systematic sensitivity report:
Expected Output Format
********** Strain **********
[Top hit at strain level with all references]
********** Species **********
[Top hit at species level excluding exact strain]
********** Genus **********
[Top hit at genus level excluding species and below]
[... continuing through all taxonomic levels ...]
********** Life **********
[Top hit at life level excluding superkingdom and below]
Sensitivity Analysis Metrics
- Identification Success: At which taxonomic levels can correct identification be made?
- Classification Confidence: What are the similarity scores at each level?
- Breakpoint Detection: At what taxonomic distance does classification fail?
- Database Coverage: Are there gaps in reference representation?
Applications
Database Evaluation
- Assess reference database completeness at different taxonomic levels
- Identify taxonomic groups with poor representation
- Evaluate the impact of database size on classification accuracy
Method Validation
- Determine taxonomic resolution limits for specific organisms
- Validate classification accuracy for novel or divergent sequences
- Compare performance across different sketch databases
Parameter Optimization
- Find optimal k-mer sizes for different taxonomic distances
- Determine minimum similarity thresholds for reliable classification
- Assess the impact of sketch size on sensitivity
Interpretation Guidelines
Successful Classification Pattern
- Strain Level: High similarity (>95%) indicates good strain match
- Species Level: Good similarity (>85%) shows species-level accuracy
- Progressive Degradation: Gradual similarity decrease as taxonomic distance increases
Classification Failure Indicators
- Sudden Drops: Large similarity decreases indicate database gaps
- Wrong Taxa: Hits to unrelated organisms suggest insufficient references
- No Hits: Complete failure may indicate very divergent sequences
Common Use Cases
Novel Organism Analysis
# Test how well a new isolate can be classified
./testsketch.sh new_isolate.fa 0 refseq minani=0.7
Database Quality Assessment
# Compare classification using different databases
./testsketch.sh test_seq.fq 562 refseq.sketch
./testsketch.sh test_seq.fq 562 nt.sketch
./testsketch.sh test_seq.fq 562 custom.sketch
Sensitivity Threshold Determination
# Test with different similarity thresholds
./testsketch.sh query.fa 1234 db.sketch minani=0.95
./testsketch.sh query.fa 1234 db.sketch minani=0.85
./testsketch.sh query.fa 1234 db.sketch minani=0.75
Performance Characteristics
- Memory Usage: 1GB default, scalable for larger databases
- Runtime: 9 SendSketch calls per analysis, typically minutes
- Systematic Testing: Consistent parameters ensure comparable results
- Scalability: Can be run in parallel for multiple test organisms
Limitations and Considerations
- Database Dependency: Results depend heavily on reference database quality
- Taxonomy Changes: NCBI taxonomy updates may affect results
- Sequence Quality: Poor quality sequences may show artificially low sensitivity
- Chimeric Sequences: Mixed organisms may produce confusing results
Related Tools
sendsketch.sh
- Core sketch-based taxonomic identification toolcomparesketch.sh
- Local sketch database comparisonsketch.sh
- Create custom sketch databasestaxonomy.sh
- Taxonomy-related utilities and lookups