AnalyzeSketchResults

Script: analyzesketchresults.sh Package: sketch Class: AnalyzeSketchResults.java

Analyzes sketch results from query, ref, ani format to generate taxonomic accuracy assessments and ANI/AAI comparisons. Processes BBSketch, Mash, or Blast output formats to compute per-taxonomic-level averages, accuracy metrics, and correlations between amino acid identity (AAI) and nucleotide identity (ANI).

Basic Usage

analyzesketchresults.sh in=<file> out=<outfile>

Input file should contain sketch comparison results in 3-column format (query, reference, ANI) from BBSketch, Mash, or other sketching tools.

Parameters

Parameters are organized by input/output files, format parsing modes, processing options, and Java runtime settings.

Input/Output Parameters

in=<file>
Required input file of sketch results in 3-column format (query, reference, ANI/similarity).
in2=<file>
Optional second input file of sketch results in amino mode for AAI comparison. Used to generate ANI vs AAI correlation plots.
out=stdout.txt
Output file for summary of per-taxonomic-level averages including mean ANI, SSU similarity, standard deviations, and sample counts.
outaccuracy=<file>
Output file for taxonomic accuracy results; requires query sequences to have taxonomic IDs for validation against known classifications.
outmap=<file>
Output file for ANI vs AAI correlation data. Requires in2 parameter to provide amino acid similarity results for comparison.
outbad=<file>
Output file for records that failed processing or had taxonomic classification errors.

Format Parsing Modes

bbsketch
Parse BBSketch output format (default). Expects standard 3-column format with query name, reference name, and similarity score.
mash
Parse Mash output format. Input files should follow Mash naming convention: tid_511145_Escherichia_coli_str._K-12_substr._MG1655.fa.gz where tid prefix indicates taxonomic ID.
sourmash
Parse SourMash output format for sketch comparison results.
blast
Parse BLAST output format (functionality under development - TODO).

Processing Options

tree=<file>
Taxonomy tree file for taxonomic level resolution and accuracy assessment.
16S=<file>
16S ribosomal SSU reference file for small subunit RNA similarity calculations.
18S=<file>
18S ribosomal SSU reference file for eukaryotic small subunit RNA analysis.
lines=<number>
Maximum number of lines to process from input file. Use -1 or omit for unlimited processing.
minsamples=1
Minimum number of samples required at a taxonomic level to include in statistical summaries.
shrinkonly=f
Enable shrink-only mode for data compression and filtering without full analysis.
verbose=f
Enable verbose output for detailed processing information and debugging.

File Handling Options

ow=f
(overwrite) Overwrites existing output files. Set to true to replace existing results.
app=f
(append) Append results to existing output files rather than overwriting.

Java Runtime Parameters

-Xmx
Sets Java heap memory usage, overriding autodetection. Examples: -Xmx20g (20 GB RAM), -Xmx200m (200 MB). Maximum is typically 85% of physical memory.
-eoom
Exit process if an out-of-memory exception occurs. Requires Java 8u92 or later.
-da
Disable Java assertions for potentially improved performance in production runs.

Examples

Basic Taxonomic Analysis

analyzesketchresults.sh in=sketch_results.txt out=tax_summary.txt

Analyzes BBSketch results to generate per-taxonomic-level statistics including mean ANI and standard deviations.

ANI vs AAI Correlation

analyzesketchresults.sh in=dna_sketch.txt in2=protein_sketch.txt outmap=ani_aai_correlation.txt

Compares nucleotide and amino acid similarity results to assess correlation between ANI and AAI values.

Accuracy Assessment with SSU

analyzesketchresults.sh in=query_results.txt outaccuracy=accuracy.txt 16S=16s_refs.fa tree=taxonomy.tree

Evaluates taxonomic classification accuracy using 16S SSU sequences and a taxonomic tree for validation.

Mash Format Processing

analyzesketchresults.sh in=mash_output.txt out=mash_analysis.txt mash minsamples=5

Processes Mash-formatted results, requiring minimum 5 samples per taxonomic level for statistical significance.

Algorithm Details

Multi-Format Parsing Strategy

AnalyzeSketchResults employs format-specific parsers to handle different sketching tool outputs:

Taxonomic Level Aggregation

The tool processes results across multiple taxonomic levels simultaneously using TaxTree integration:

SSU Integration and Validation

Small Subunit RNA processing for enhanced taxonomic validation:

ANI-AAI Correlation Analysis

When both nucleotide and amino acid sketch results are provided:

Memory Management and Scalability

Optimized for processing large-scale sketch comparison datasets:

Output Format Flexibility

Multiple output modes for different analysis needs:

Support

For questions and support: