SummarizeSketch
Summarizes the output of BBSketch by processing sketch comparison results and generating tabular summaries with taxonomic analysis and contamination detection.
Basic Usage
summarizesketch.sh in=<file,file...> out=<file>
You can alternately run: summarizesketch.sh *.txt out=out.txt
Parameters
Parameters control input/output processing, taxonomic filtering, and contamination analysis for BBSketch results summarization.
Input/Output Parameters
- in=<file>
- A list of stats files, or a text file containing one stats file name per line. Multiple files can be specified with comma separation.
- out=<file>
- Destination for summary output. Default is stdout if not specified.
Taxonomic Analysis Parameters
- tree=
- A TaxTree file for taxonomic analysis. Use "auto" to load the default tax tree file. Required for taxonomic filtering functionality.
- level=genus
- Taxonomic level at which to ignore contaminants with the same taxonomy as the primary hit. Controls contamination detection sensitivity by filtering out hits at the same taxonomic rank.
Contamination Detection Parameters
- unique=f
- Use the contaminant with the most unique hits rather than highest score when selecting the top contaminant. When true, prioritizes unique matches over overall similarity scores.
Output Control Parameters
- header=t
- (printheader) Include column headers in the output. Set to false to suppress header line for easier parsing.
- printtotal=t
- (pt) Print total statistics in the output. Controls whether summary totals are included in the final report.
Legacy Parameters
- ignoresametaxa=f
- Legacy parameter from SealStats. Ignore results with the same taxonomic assignment.
- ignoresamebarcode=f
- (ignoresameindex) Legacy parameter from SealStats. Ignore results with the same barcode or index.
- ignoresamelocation=f
- (ignoresameloc) Legacy parameter from SealStats. Ignore results from the same location.
- totaldenominator=f
- (usetotal, totald, td) Legacy parameter from SealStats. Use total count as denominator in calculations.
Examples
Basic Summarization
summarizesketch.sh in=sketch_results.txt out=summary.txt
Summarizes a single BBSketch output file into a tabular format.
Multiple Files with Taxonomic Analysis
summarizesketch.sh in=file1.txt,file2.txt,file3.txt out=combined_summary.txt tree=auto level=species
Processes multiple sketch result files with taxonomic filtering at the species level.
Contamination Analysis with Unique Hits
summarizesketch.sh in=*.txt out=contamination_report.txt tree=auto unique=t level=genus
Analyzes all text files for contamination using unique hit counts rather than similarity scores to identify contaminants.
Minimal Output Without Headers
summarizesketch.sh in=results.txt out=data_only.txt header=f
Generates summary output without column headers for direct data processing.
Output Format
The tool generates tab-delimited output with the following columns:
- query - Query sequence name
- seqs - Number of sequences in query
- bases - Total bases in query
- gSize - Estimated genome size
- sketchLen - Length of the sketch
- primaryHits - Number of hits for primary match
- primaryUnique - Unique hits for primary match
- primaryNoHit - Number of kmers with no hits
- WKID - Weighted Kmer Identity
- KID - Kmer Identity
- ANI - Average Nucleotide Identity
- Complt - Completeness percentage
- Contam - Contamination percentage
- TaxID - Taxonomic ID of primary hit
- TaxName - Taxonomic name of primary hit
- topContamID - Taxonomic ID of top contaminant
- topContamName - Taxonomic name of top contaminant
Algorithm Details
SummarizeSketch processes BBSketch output files to create consolidated summaries with contamination analysis:
Input Processing
The tool parses BBSketch output files that contain query headers and result lines. Each query section begins with a "Query:" line containing metadata (sequence count, bases, genome size, sketch length) followed by tab-delimited result lines with similarity metrics and taxonomic information.
Contamination Detection Strategy
The algorithm identifies potential contaminants by analyzing secondary hits in the sketch results:
- Primary Hit Selection: The first result line is considered the primary match
- Secondary Hit Analysis: Subsequent hits are evaluated as potential contaminants
- Taxonomic Filtering: When a tax tree is provided, contaminants are filtered based on taxonomic distance from the primary hit at the specified level
- Selection Criteria: Contaminants can be selected by highest score (default) or most unique hits (when unique=t)
Taxonomic Level Filtering
The level parameter controls contamination sensitivity by determining when two organisms are considered "same taxa". The algorithm finds the common ancestor of primary and potential contaminant hits, then compares the ancestral taxonomic level to the specified threshold. Hits sharing ancestry at or above the specified level are filtered out as non-contaminants.
Output Generation
For each query, the tool combines metadata from the query header with metrics from the primary hit and top contaminant (if identified). Numerical values are formatted to two decimal places for consistency, and taxonomic names are preserved from the original sketch results.
Memory and Performance
The tool processes files sequentially and maintains minimal memory footprint by processing one query result set at a time. Large batch processing is supported through the file list input format, allowing efficient processing of hundreds of sketch result files.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org