testsketch.sh

Script: testsketch.sh Author: Brian Bushnell Last Modified: November 17, 2019 Type: Sensitivity Analysis Pipeline

Systematic sensitivity analysis pipeline for sketch-based taxonomic classification. Tests SendSketch performance across multiple taxonomic levels (strain to superkingdom) by progressively excluding closer relatives, enabling comprehensive evaluation of classification accuracy and sensitivity at different phylogenetic distances.

Purpose

This pipeline evaluates the sensitivity and accuracy of sketch-based taxonomic classification by systematically testing identification performance when progressively excluding closer taxonomic relatives. It helps determine the taxonomic resolution limits and optimal database configurations for specific classification needs.

Usage

testsketch.sh <query_file> <target_taxid> <database> [optional_flags] [additional_flags]

# Parameters:
# query_file    - FASTA/FASTQ file to test
# target_taxid  - NCBI taxonomy ID of the expected organism
# database      - Sketch database to query against
# optional_flags - Additional SendSketch parameters
# additional_flags - More SendSketch parameters

Pipeline Analysis Levels

The pipeline tests taxonomic classification at 9 different taxonomic levels, progressively excluding closer relatives:

Level 1: Strain-Level Classification

sendsketch.sh $QUERY $DB -Xmx1g colors=f records=1 minhits=1 silent banunclassified printcommonancestorlevel tid=$TID include=$TID $EXTRA $EXTRA2

Test: Can the exact strain be identified when all references are available?

Level 2: Species-Level Classification

sendsketch.sh $QUERY $DB -Xmx1g colors=f records=1 minhits=1 silent banunclassified printcommonancestorlevel tid=$TID include=$TID includelevel=species exclude=$TID excludelevel=strain $EXTRA $EXTRA2

Test: Can the species be identified when the exact strain is excluded?

Level 3: Genus-Level Classification

sendsketch.sh $QUERY $DB -Xmx1g colors=f records=1 minhits=1 silent banunclassified printcommonancestorlevel tid=$TID include=$TID includelevel=genus exclude=$TID excludelevel=species $EXTRA $EXTRA2

Test: Can the genus be identified when the species and closer relatives are excluded?

Level 4: Family-Level Classification

sendsketch.sh $QUERY $DB -Xmx1g colors=f records=1 minhits=1 silent banunclassified printcommonancestorlevel tid=$TID include=$TID includelevel=family exclude=$TID excludelevel=genus $EXTRA $EXTRA2

Test: Can the family be identified when genus and closer relatives are excluded?

Levels 5-9: Higher Taxonomic Levels

The pipeline continues through progressively higher taxonomic levels:

Key Parameters

Each SendSketch call uses standardized parameters for consistent comparison:

Standard Parameters

Taxonomic Control Parameters

Example Usage

# Test E. coli strain sensitivity using RefSeq database
./testsketch.sh ecoli_reads.fq 511145 refseq k=31

# Test with custom parameters
./testsketch.sh sample.fa 287 nt minhits=3 minani=0.85

# Test archaeal organism with specialized database
./testsketch.sh archaea.fq 2157 archaea.sketch records=5

Output Interpretation

The pipeline produces a systematic sensitivity report:

Expected Output Format

********** Strain **********
[Top hit at strain level with all references]

********** Species **********
[Top hit at species level excluding exact strain]

********** Genus **********
[Top hit at genus level excluding species and below]

[... continuing through all taxonomic levels ...]

********** Life **********
[Top hit at life level excluding superkingdom and below]

Sensitivity Analysis Metrics

Applications

Database Evaluation

Method Validation

Parameter Optimization

Interpretation Guidelines

Successful Classification Pattern

Classification Failure Indicators

Common Use Cases

Novel Organism Analysis

# Test how well a new isolate can be classified
./testsketch.sh new_isolate.fa 0 refseq minani=0.7

Database Quality Assessment

# Compare classification using different databases  
./testsketch.sh test_seq.fq 562 refseq.sketch
./testsketch.sh test_seq.fq 562 nt.sketch
./testsketch.sh test_seq.fq 562 custom.sketch

Sensitivity Threshold Determination

# Test with different similarity thresholds
./testsketch.sh query.fa 1234 db.sketch minani=0.95
./testsketch.sh query.fa 1234 db.sketch minani=0.85
./testsketch.sh query.fa 1234 db.sketch minani=0.75

Performance Characteristics

Limitations and Considerations

Related Tools