AnalyzeSketchResults
Analyzes sketch results from query, ref, ani format to generate taxonomic accuracy assessments and ANI/AAI comparisons. Processes BBSketch, Mash, or Blast output formats to compute per-taxonomic-level averages, accuracy metrics, and correlations between amino acid identity (AAI) and nucleotide identity (ANI).
Basic Usage
analyzesketchresults.sh in=<file> out=<outfile>
Input file should contain sketch comparison results in 3-column format (query, reference, ANI) from BBSketch, Mash, or other sketching tools.
Parameters
Parameters are organized by input/output files, format parsing modes, processing options, and Java runtime settings.
Input/Output Parameters
- in=<file>
- Required input file of sketch results in 3-column format (query, reference, ANI/similarity).
- in2=<file>
- Optional second input file of sketch results in amino mode for AAI comparison. Used to generate ANI vs AAI correlation plots.
- out=stdout.txt
- Output file for summary of per-taxonomic-level averages including mean ANI, SSU similarity, standard deviations, and sample counts.
- outaccuracy=<file>
- Output file for taxonomic accuracy results; requires query sequences to have taxonomic IDs for validation against known classifications.
- outmap=<file>
- Output file for ANI vs AAI correlation data. Requires in2 parameter to provide amino acid similarity results for comparison.
- outbad=<file>
- Output file for records that failed processing or had taxonomic classification errors.
Format Parsing Modes
- bbsketch
- Parse BBSketch output format (default). Expects standard 3-column format with query name, reference name, and similarity score.
- mash
- Parse Mash output format. Input files should follow Mash naming convention: tid_511145_Escherichia_coli_str._K-12_substr._MG1655.fa.gz where tid prefix indicates taxonomic ID.
- sourmash
- Parse SourMash output format for sketch comparison results.
- blast
- Parse BLAST output format (functionality under development - TODO).
Processing Options
- tree=<file>
- Taxonomy tree file for taxonomic level resolution and accuracy assessment.
- 16S=<file>
- 16S ribosomal SSU reference file for small subunit RNA similarity calculations.
- 18S=<file>
- 18S ribosomal SSU reference file for eukaryotic small subunit RNA analysis.
- lines=<number>
- Maximum number of lines to process from input file. Use -1 or omit for unlimited processing.
- minsamples=1
- Minimum number of samples required at a taxonomic level to include in statistical summaries.
- shrinkonly=f
- Enable shrink-only mode for data compression and filtering without full analysis.
- verbose=f
- Enable verbose output for detailed processing information and debugging.
File Handling Options
- ow=f
- (overwrite) Overwrites existing output files. Set to true to replace existing results.
- app=f
- (append) Append results to existing output files rather than overwriting.
Java Runtime Parameters
- -Xmx
- Sets Java heap memory usage, overriding autodetection. Examples: -Xmx20g (20 GB RAM), -Xmx200m (200 MB). Maximum is typically 85% of physical memory.
- -eoom
- Exit process if an out-of-memory exception occurs. Requires Java 8u92 or later.
- -da
- Disable Java assertions for potentially improved performance in production runs.
Examples
Basic Taxonomic Analysis
analyzesketchresults.sh in=sketch_results.txt out=tax_summary.txt
Analyzes BBSketch results to generate per-taxonomic-level statistics including mean ANI and standard deviations.
ANI vs AAI Correlation
analyzesketchresults.sh in=dna_sketch.txt in2=protein_sketch.txt outmap=ani_aai_correlation.txt
Compares nucleotide and amino acid similarity results to assess correlation between ANI and AAI values.
Accuracy Assessment with SSU
analyzesketchresults.sh in=query_results.txt outaccuracy=accuracy.txt 16S=16s_refs.fa tree=taxonomy.tree
Evaluates taxonomic classification accuracy using 16S SSU sequences and a taxonomic tree for validation.
Mash Format Processing
analyzesketchresults.sh in=mash_output.txt out=mash_analysis.txt mash minsamples=5
Processes Mash-formatted results, requiring minimum 5 samples per taxonomic level for statistical significance.
Algorithm Details
Multi-Format Parsing Strategy
AnalyzeSketchResults employs format-specific parsers to handle different sketching tool outputs:
- BBSketch Mode: Standard 3-column parsing with query name, reference name, and similarity score
- Mash Mode: Extracts taxonomic IDs from filename prefixes (tid_XXXXX format) for automated taxonomic assignment
- SourMash Mode: Handles SourMash-specific output format and metadata
- BLAST Mode: Planned support for BLAST tabular output formats
Taxonomic Level Aggregation
The tool processes results across multiple taxonomic levels simultaneously using TaxTree integration:
- Level Processing: Computes statistics for strain, species, genus, family, order, class, phylum, superkingdom, and life levels
- Statistical Calculation: Generates mean, standard deviation, and sample counts for ANI and SSU similarity at each level
- Filtering: Applies minimum sample thresholds to ensure statistical significance
SSU Integration and Validation
Small Subunit RNA processing for enhanced taxonomic validation:
- Dual SSU Support: Handles both 16S (prokaryotic) and 18S (eukaryotic) reference databases
- Threading Strategy: Uses multi-threaded processing for SSU similarity calculations across large datasets
- Accuracy Assessment: Compares sketch-based classifications against SSU-based assignments for validation
ANI-AAI Correlation Analysis
When both nucleotide and amino acid sketch results are provided:
- Paired Analysis: Matches corresponding query-reference pairs between DNA and protein sketch results
- Correlation Mapping: Generates correlation data between Average Nucleotide Identity (ANI) and Average Amino Acid Identity (AAI)
- Cross-Validation: Identifies discrepancies between nucleotide and protein-based similarity assessments
Memory Management and Scalability
Optimized for processing large-scale sketch comparison datasets:
- Streaming Processing: Processes input files line-by-line to minimize memory footprint
- HashMap Optimization: Uses efficient key-value storage for ANI and AAI mapping with Long keys for query-reference pairs
- Configurable Limits: Supports line limits and sample filtering to manage processing scope
Output Format Flexibility
Multiple output modes for different analysis needs:
- Summary Statistics: Tab-delimited per-level averages with taxonomic rank information
- Accuracy Tables: Detailed classification correctness assessment with SSU validation
- Correlation Data: ANI vs AAI scatter plot data for visualization and correlation analysis
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org