SummarizeSketch

Basic Usage

summarizesketch.sh in=<file,file...> out=<file>

You can alternately run: summarizesketch.sh *.txt out=out.txt

Parameters

Parameters control input/output processing, taxonomic filtering, and contamination analysis for BBSketch results summarization.

Input/Output Parameters

in=<file>: A list of stats files, or a text file containing one stats file name per line. Multiple files can be specified with comma separation.
out=<file>: Destination for summary output. Default is stdout if not specified.

Taxonomic Analysis Parameters

tree=: A TaxTree file for taxonomic analysis. Use "auto" to load the default tax tree file. Required for taxonomic filtering functionality.
level=genus: Taxonomic level at which to ignore contaminants with the same taxonomy as the primary hit. Controls contamination detection sensitivity by filtering out hits at the same taxonomic rank.

Contamination Detection Parameters

unique=f: Use the contaminant with the most unique hits rather than highest score when selecting the top contaminant. When true, prioritizes unique matches over overall similarity scores.

Output Control Parameters

header=t: (printheader) Include column headers in the output. Set to false to suppress header line for easier parsing.
printtotal=t: (pt) Print total statistics in the output. Controls whether summary totals are included in the final report.

Legacy Parameters

ignoresametaxa=f: Legacy parameter from SealStats. Ignore results with the same taxonomic assignment.
ignoresamebarcode=f: (ignoresameindex) Legacy parameter from SealStats. Ignore results with the same barcode or index.
ignoresamelocation=f: (ignoresameloc) Legacy parameter from SealStats. Ignore results from the same location.
totaldenominator=f: (usetotal, totald, td) Legacy parameter from SealStats. Use total count as denominator in calculations.

Examples

Basic Summarization

summarizesketch.sh in=sketch_results.txt out=summary.txt

Summarizes a single BBSketch output file into a tabular format.

Multiple Files with Taxonomic Analysis

summarizesketch.sh in=file1.txt,file2.txt,file3.txt out=combined_summary.txt tree=auto level=species

Processes multiple sketch result files with taxonomic filtering at the species level.

Contamination Analysis with Unique Hits

summarizesketch.sh in=*.txt out=contamination_report.txt tree=auto unique=t level=genus

Analyzes all text files for contamination using unique hit counts rather than similarity scores to identify contaminants.

Minimal Output Without Headers

summarizesketch.sh in=results.txt out=data_only.txt header=f

Generates summary output without column headers for direct data processing.

Output Format

The tool generates tab-delimited output with the following columns:

query - Query sequence name
seqs - Number of sequences in query
bases - Total bases in query
gSize - Estimated genome size
sketchLen - Length of the sketch
primaryHits - Number of hits for primary match
primaryUnique - Unique hits for primary match
primaryNoHit - Number of kmers with no hits
WKID - Weighted Kmer Identity
KID - Kmer Identity
ANI - Average Nucleotide Identity
Complt - Completeness percentage
Contam - Contamination percentage
TaxID - Taxonomic ID of primary hit
TaxName - Taxonomic name of primary hit
topContamID - Taxonomic ID of top contaminant
topContamName - Taxonomic name of top contaminant

Algorithm Details

SummarizeSketch processes BBSketch output files to create consolidated summaries with contamination analysis:

Input Processing

The tool parses BBSketch output files that contain query headers and result lines. Each query section begins with a "Query:" line containing metadata (sequence count, bases, genome size, sketch length) followed by tab-delimited result lines with similarity metrics and taxonomic information.

Contamination Detection Strategy

The algorithm identifies potential contaminants by analyzing secondary hits in the sketch results:

Primary Hit Selection: The first result line is considered the primary match
Secondary Hit Analysis: Subsequent hits are evaluated as potential contaminants
Taxonomic Filtering: When a tax tree is provided, contaminants are filtered based on taxonomic distance from the primary hit at the specified level
Selection Criteria: Contaminants can be selected by highest score (default) or most unique hits (when unique=t)

Taxonomic Level Filtering

The level parameter controls contamination sensitivity by determining when two organisms are considered "same taxa". The algorithm finds the common ancestor of primary and potential contaminant hits, then compares the ancestral taxonomic level to the specified threshold. Hits sharing ancestry at or above the specified level are filtered out as non-contaminants.

Output Generation

For each query, the tool combines metadata from the query header with metrics from the primary hit and top contaminant (if identified). Numerical values are formatted to two decimal places for consistency, and taxonomic names are preserved from the original sketch results.

Memory and Performance

The tool processes files sequentially and maintains minimal memory footprint by processing one query result set at a time. Large batch processing is supported through the file list input format, allowing efficient processing of hundreds of sketch result files.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org