testsketch.sh

Script: testsketch.sh Author: Brian Bushnell Last Modified: November 17, 2019 Type: Sensitivity Analysis Pipeline

Systematic sensitivity analysis pipeline for sketch-based taxonomic classification. Tests SendSketch performance across multiple taxonomic levels (strain to superkingdom) by progressively excluding closer relatives, enabling comprehensive evaluation of classification accuracy and sensitivity at different phylogenetic distances.

Purpose

This pipeline evaluates the sensitivity and accuracy of sketch-based taxonomic classification by systematically testing identification performance when progressively excluding closer taxonomic relatives. It helps determine the taxonomic resolution limits and optimal database configurations for specific classification needs.

Usage

testsketch.sh <query_file> <target_taxid> <database> [optional_flags] [additional_flags]

# Parameters:
# query_file    - FASTA/FASTQ file to test
# target_taxid  - NCBI taxonomy ID of the expected organism
# database      - Sketch database to query against
# optional_flags - Additional SendSketch parameters
# additional_flags - More SendSketch parameters

Pipeline Analysis Levels

The pipeline tests taxonomic classification at 9 different taxonomic levels, progressively excluding closer relatives:

Level 1: Strain-Level Classification

sendsketch.sh $QUERY $DB -Xmx1g colors=f records=1 minhits=1 silent banunclassified printcommonancestorlevel tid=$TID include=$TID $EXTRA $EXTRA2

Test: Can the exact strain be identified when all references are available?

Level 2: Species-Level Classification

sendsketch.sh $QUERY $DB -Xmx1g colors=f records=1 minhits=1 silent banunclassified printcommonancestorlevel tid=$TID include=$TID includelevel=species exclude=$TID excludelevel=strain $EXTRA $EXTRA2

Test: Can the species be identified when the exact strain is excluded?

Level 3: Genus-Level Classification

sendsketch.sh $QUERY $DB -Xmx1g colors=f records=1 minhits=1 silent banunclassified printcommonancestorlevel tid=$TID include=$TID includelevel=genus exclude=$TID excludelevel=species $EXTRA $EXTRA2

Test: Can the genus be identified when the species and closer relatives are excluded?

Level 4: Family-Level Classification

sendsketch.sh $QUERY $DB -Xmx1g colors=f records=1 minhits=1 silent banunclassified printcommonancestorlevel tid=$TID include=$TID includelevel=family exclude=$TID excludelevel=genus $EXTRA $EXTRA2

Test: Can the family be identified when genus and closer relatives are excluded?

Levels 5-9: Higher Taxonomic Levels

The pipeline continues through progressively higher taxonomic levels:

Order: Excludes family and below
Class: Excludes order and below
Phylum: Excludes class and below
Superkingdom: Excludes phylum and below
Life: Excludes superkingdom and below

Key Parameters

Each SendSketch call uses standardized parameters for consistent comparison:

Standard Parameters

-Xmx1g - Use 1GB memory (adjustable for larger datasets)
colors=f - Disable colored output for cleaner logging
records=1 - Return only the top hit for focused analysis
minhits=1 - Require at least one kmer hit
silent - Suppress verbose output
banunclassified - Exclude unclassified sequences
printcommonancestorlevel - Report common ancestor taxonomy level

Taxonomic Control Parameters

tid=$TID - Target taxonomy ID for analysis
include=$TID - Include target and descendants
includelevel=X - Include up to taxonomic level X
exclude=$TID - Exclude target and descendants
excludelevel=Y - Exclude taxonomic level Y and below

Example Usage

# Test E. coli strain sensitivity using RefSeq database
./testsketch.sh ecoli_reads.fq 511145 refseq k=31

# Test with custom parameters
./testsketch.sh sample.fa 287 nt minhits=3 minani=0.85

# Test archaeal organism with specialized database
./testsketch.sh archaea.fq 2157 archaea.sketch records=5

Output Interpretation

The pipeline produces a systematic sensitivity report:

Expected Output Format

********** Strain **********
[Top hit at strain level with all references]

********** Species **********
[Top hit at species level excluding exact strain]

********** Genus **********
[Top hit at genus level excluding species and below]

[... continuing through all taxonomic levels ...]

********** Life **********
[Top hit at life level excluding superkingdom and below]

Sensitivity Analysis Metrics

Identification Success: At which taxonomic levels can correct identification be made?
Classification Confidence: What are the similarity scores at each level?
Breakpoint Detection: At what taxonomic distance does classification fail?
Database Coverage: Are there gaps in reference representation?

Applications

Database Evaluation

Assess reference database completeness at different taxonomic levels
Identify taxonomic groups with poor representation
Evaluate the impact of database size on classification accuracy

Method Validation

Determine taxonomic resolution limits for specific organisms
Validate classification accuracy for novel or divergent sequences
Compare performance across different sketch databases

Parameter Optimization

Find optimal k-mer sizes for different taxonomic distances
Determine minimum similarity thresholds for reliable classification
Assess the impact of sketch size on sensitivity

Interpretation Guidelines

Successful Classification Pattern

Strain Level: High similarity (>95%) indicates good strain match
Species Level: Good similarity (>85%) shows species-level accuracy
Progressive Degradation: Gradual similarity decrease as taxonomic distance increases

Classification Failure Indicators

Sudden Drops: Large similarity decreases indicate database gaps
Wrong Taxa: Hits to unrelated organisms suggest insufficient references
No Hits: Complete failure may indicate very divergent sequences

Common Use Cases

Novel Organism Analysis

# Test how well a new isolate can be classified
./testsketch.sh new_isolate.fa 0 refseq minani=0.7

Database Quality Assessment

# Compare classification using different databases  
./testsketch.sh test_seq.fq 562 refseq.sketch
./testsketch.sh test_seq.fq 562 nt.sketch
./testsketch.sh test_seq.fq 562 custom.sketch

Sensitivity Threshold Determination

# Test with different similarity thresholds
./testsketch.sh query.fa 1234 db.sketch minani=0.95
./testsketch.sh query.fa 1234 db.sketch minani=0.85
./testsketch.sh query.fa 1234 db.sketch minani=0.75

Performance Characteristics

Memory Usage: 1GB default, scalable for larger databases
Runtime: 9 SendSketch calls per analysis, typically minutes
Systematic Testing: Consistent parameters ensure comparable results
Scalability: Can be run in parallel for multiple test organisms

Limitations and Considerations

Database Dependency: Results depend heavily on reference database quality
Taxonomy Changes: NCBI taxonomy updates may affect results
Sequence Quality: Poor quality sequences may show artificially low sensitivity
Chimeric Sequences: Mixed organisms may produce confusing results

Related Tools

sendsketch.sh - Core sketch-based taxonomic identification tool
comparesketch.sh - Local sketch database comparison
sketch.sh - Create custom sketch databases
taxonomy.sh - Taxonomy-related utilities and lookups