SummarizeSeal

Script: summarizeseal.sh Package: driver Class: SummarizeSealStats.java

Summarizes the stats output of Seal for evaluation of cross-contamination. The intended use is to map multiple libraries or assemblies, of different multiplexed organisms, to a concatenated reference containing one fused scaffold per organism. This will convert all of the resulting stats files (one per library) to a single text file, with multiple columns, indicating how much of the input hit the primary versus nonprimary scaffolds.

Basic Usage

summarizeseal.sh in=<file,file...> out=<file>

# Alternative usage with wildcards
summarizeseal.sh *.txt out=out.txt

SummarizeSeal processes multiple stats files from Seal mapping results and produces a consolidated summary showing contamination levels between different organisms or libraries.

Reference Name Format

When using ignoresametaxa, ignoresamebarcode, or ignoresamelocation parameters, reference names must follow this specific format:

barcode,library,tax,location

Example: 6-G,N0296,gammaproteobacteria_bacterium,deep_ocean

barcode: Sample barcode identifier (e.g., "6-G")
library: Library identifier (e.g., "N0296")
tax: Taxonomic classification (e.g., "gammaproteobacteria_bacterium")
location: Sampling site (e.g., "deep_ocean")

Parameters

Parameters control input/output handling and contamination filtering options.

Input/Output Parameters

in=<file>: A list of stats files, or a text file containing one stats file name per line. Can also specify multiple files separated by commas. Stats files should be output from Seal mapping operations.
out=<file>: Destination for summary output. The output will be a tab-delimited file with columns for File, Primary_Name, Primary_Count, Other_Count, Primary_Bases, Other_Bases, and Other_ppm.

Output Control Parameters

printtotal=t: (pt) Print a line summarizing the total contamination rate of all assemblies. When enabled, adds a "TOTAL" row that aggregates statistics across all input files. Default: true
totaldenominator=f: (td) Use all bases as denominator rather than mapped bases when calculating contamination rates (ppm). When false, uses only mapped bases (primary + other) as denominator. When true, uses total bases from input. Default: false

Contamination Filtering Parameters

ignoresametaxa=f: Ignore secondary hits sharing taxonomy. When enabled, hits to references with similar taxonomic classifications will not be counted as contamination. Requires reference names in the specified format. Default: false
ignoresamebarcode=f: Ignore secondary hits sharing a barcode. When enabled, hits to references with the same barcode identifier will not be counted as contamination. Useful for multiplexed samples where cross-barcode contamination is of interest. Default: false
ignoresamelocation=f: Ignore secondary hits sharing a sampling site. When enabled, hits to references from the same sampling location will not be counted as contamination. Useful for environmental samples where geographic cross-contamination is expected. Default: false

Examples

Basic Usage

# Process multiple stats files
summarizeseal.sh in=sample1_stats.txt,sample2_stats.txt,sample3_stats.txt out=contamination_summary.txt

# Process all .txt files in current directory
summarizeseal.sh *.txt out=all_samples_summary.txt

Basic usage for consolidating multiple Seal stats files into a single contamination summary.

Advanced Filtering

# Ignore contamination between same taxa
summarizeseal.sh in=stats_list.txt out=filtered_summary.txt ignoresametaxa=t

# Ignore contamination between same barcodes and locations
summarizeseal.sh in=environmental_stats.txt out=cross_location_summary.txt ignoresamebarcode=t ignoresamelocation=t

# Use total bases for contamination calculation
summarizeseal.sh in=mapping_results.txt out=total_based_summary.txt totaldenominator=t printtotal=t

Advanced usage with contamination filtering and different calculation methods.

File List Input

# Create a file list
echo "sample1_seal_stats.txt" > file_list.txt
echo "sample2_seal_stats.txt" >> file_list.txt
echo "sample3_seal_stats.txt" >> file_list.txt

# Process using file list
summarizeseal.sh in=file_list.txt out=batch_summary.txt

Using a text file containing a list of stats files for batch processing.

Output Format

The output is a tab-delimited file with the following columns:

File: Input stats file name
Primary_Name: Name of the reference with the highest hit count/bases
Primary_Count: Number of reads mapped to the primary reference
Other_Count: Total number of reads mapped to non-primary references
Primary_Bases: Total bases mapped to the primary reference
Other_Bases: Total bases mapped to non-primary references
Other_ppm: Parts per million contamination rate (other bases relative to total mapped bases or total bases, depending on totaldenominator setting)

Sample Output

#File	Primary_Name	Primary_Count	Other_Count	Primary_Bases	Other_Bases	Other_ppm
TOTAL	6-G,N0296,gammaproteobacteria_bacterium,deep_ocean	45823	1247	4582300	124700	26513.84
sample1_stats.txt	6-G,N0296,gammaproteobacteria_bacterium,deep_ocean	15274	423	1527400	42300	26943.37
sample2_stats.txt	7-A,N0298,alphaproteobacteria_sp,surface_water	18359	534	1835900	53400	28316.45
sample3_stats.txt	8-C,N0301,betaproteobacteria_strain,mid_depth	12190	290	1219000	29000	23213.51

Example output showing contamination summary for multiple samples with TOTAL row included.

Algorithm Details

Processing Strategy

SummarizeSeal implements a two-phase algorithm for contamination analysis:

Phase 1: Primary Reference Selection

For each input stats file, the algorithm identifies the primary reference using a dual-criteria selection:

Primary criterion: Reference with the highest number of mapped bases
Tie-breaking criterion: If bases are equal, reference with the highest read count wins
All other references are classified as "other" (potential contamination sources)

Phase 2: Contamination Calculation

The contamination rate is calculated using different denominators based on the totaldenominator setting:

Default mode (totaldenominator=f): ppm = other_bases × 1,000,000 / (primary_bases + other_bases)
Total denominator mode (totaldenominator=t): ppm = other_bases × 1,000,000 / total_input_bases

Advanced Filtering Logic

When contamination filtering options are enabled, the algorithm applies string-based matching algorithms:

Taxonomic Filtering (ignoresametaxa=t)

Uses String.contains() method for taxonomic classifications in name[2] field. Two references are considered the same taxa if either taxonomy contains the other as a substring. This handles cases where taxonomic names have different levels of specificity.

Barcode Filtering (ignoresamebarcode=t)

Parses barcode identifiers using String.split("-") and compares both barcode[0] and barcode[1] components using equals() method. References sharing either barcode component are filtered out.

Location Filtering (ignoresamelocation=t)

Performs exact string matching using equals() method on name[3] location field. References from identical sampling sites are not counted as contamination sources.

Data Structure Implementation

The algorithm uses ArrayList<SealSummary> collections with TextFile streaming:

SealSummary class: Encapsulates per-file statistics using long counters (pcount, ocount, tcount, pbases, obases, tbases)
TextFile streaming: Processes stats files line-by-line using TextFile.nextLine() to minimize memory usage
Aggregation strategy: Maintains separate long counters for primary and other hits, using simple addition for totaling

Memory Characteristics

The implementation processes files with controlled memory usage:

Default memory allocation: 120MB heap space (-Xmx120m)
TextFile streaming: Files are processed line-by-line using TextFile.nextLine() without loading entire datasets into memory
ArrayList<SealSummary> storage: Only SealSummary objects containing long counters are retained in memory during processing

Output Precision

Contamination rates are calculated and formatted with specific precision:

PPM calculation: Uses double-precision arithmetic (obases*1000000.0/(obases+pbases)) for accuracy
Output formatting: Results are displayed with 2 decimal places using Tools.format("%.2f") method
Zero contamination handling: Explicit check (obases==0 ? 0 : calculation) prevents division by zero

Technical Notes

Input Requirements

Stats files must be in the format produced by Seal (or compatible tools)
Expected format includes tab-delimited columns with reference names, read counts, and base counts
Total counts should be prefixed with "#Total" in the stats files

Reference Name Parsing

Reference names are converted to lowercase for consistent comparison
Barcode parsing expects format: "prefix-suffix" (e.g., "6-G")
Field parsing uses comma-separated format: "barcode,library,taxonomy,location"

Performance Characteristics

Time complexity: O(n×m) where n is number of files and m is average lines per file
Space complexity: O(n) for storing summary data for each input file
I/O pattern: Sequential reads with single-pass processing per file

Error Handling

Invalid reference name formats trigger assertion failures with diagnostic messages
Missing or malformed input files are handled gracefully
Unknown parameters result in RuntimeException with clear error messages

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org