SummarizeSketch

Script: summarizesketch.sh Package: sketch Class: SummarizeSketchStats.java

Summarizes the output of BBSketch by processing sketch comparison results and generating tabular summaries with taxonomic analysis and contamination detection.

Basic Usage

summarizesketch.sh in=<file,file...> out=<file>

You can alternately run: summarizesketch.sh *.txt out=out.txt

Parameters

Parameters control input/output processing, taxonomic filtering, and contamination analysis for BBSketch results summarization.

Input/Output Parameters

in=<file>
A list of stats files, or a text file containing one stats file name per line. Multiple files can be specified with comma separation.
out=<file>
Destination for summary output. Default is stdout if not specified.

Taxonomic Analysis Parameters

tree=
A TaxTree file for taxonomic analysis. Use "auto" to load the default tax tree file. Required for taxonomic filtering functionality.
level=genus
Taxonomic level at which to ignore contaminants with the same taxonomy as the primary hit. Controls contamination detection sensitivity by filtering out hits at the same taxonomic rank.

Contamination Detection Parameters

unique=f
Use the contaminant with the most unique hits rather than highest score when selecting the top contaminant. When true, prioritizes unique matches over overall similarity scores.

Output Control Parameters

header=t
(printheader) Include column headers in the output. Set to false to suppress header line for easier parsing.
printtotal=t
(pt) Print total statistics in the output. Controls whether summary totals are included in the final report.

Legacy Parameters

ignoresametaxa=f
Legacy parameter from SealStats. Ignore results with the same taxonomic assignment.
ignoresamebarcode=f
(ignoresameindex) Legacy parameter from SealStats. Ignore results with the same barcode or index.
ignoresamelocation=f
(ignoresameloc) Legacy parameter from SealStats. Ignore results from the same location.
totaldenominator=f
(usetotal, totald, td) Legacy parameter from SealStats. Use total count as denominator in calculations.

Examples

Basic Summarization

summarizesketch.sh in=sketch_results.txt out=summary.txt

Summarizes a single BBSketch output file into a tabular format.

Multiple Files with Taxonomic Analysis

summarizesketch.sh in=file1.txt,file2.txt,file3.txt out=combined_summary.txt tree=auto level=species

Processes multiple sketch result files with taxonomic filtering at the species level.

Contamination Analysis with Unique Hits

summarizesketch.sh in=*.txt out=contamination_report.txt tree=auto unique=t level=genus

Analyzes all text files for contamination using unique hit counts rather than similarity scores to identify contaminants.

Minimal Output Without Headers

summarizesketch.sh in=results.txt out=data_only.txt header=f

Generates summary output without column headers for direct data processing.

Output Format

The tool generates tab-delimited output with the following columns:

Algorithm Details

SummarizeSketch processes BBSketch output files to create consolidated summaries with contamination analysis:

Input Processing

The tool parses BBSketch output files that contain query headers and result lines. Each query section begins with a "Query:" line containing metadata (sequence count, bases, genome size, sketch length) followed by tab-delimited result lines with similarity metrics and taxonomic information.

Contamination Detection Strategy

The algorithm identifies potential contaminants by analyzing secondary hits in the sketch results:

Taxonomic Level Filtering

The level parameter controls contamination sensitivity by determining when two organisms are considered "same taxa". The algorithm finds the common ancestor of primary and potential contaminant hits, then compares the ancestral taxonomic level to the specified threshold. Hits sharing ancestry at or above the specified level are filtered out as non-contaminants.

Output Generation

For each query, the tool combines metadata from the query header with metrics from the primary hit and top contaminant (if identified). Numerical values are formatted to two decimal places for consistency, and taxonomic names are preserved from the original sketch results.

Memory and Performance

The tool processes files sequentially and maintains minimal memory footprint by processing one query result set at a time. Large batch processing is supported through the file list input format, allowing efficient processing of hundreds of sketch result files.

Support

For questions and support: