SummarizeSeal

Script: summarizeseal.sh Package: driver Class: SummarizeSealStats.java

Summarizes the stats output of Seal for evaluation of cross-contamination. The intended use is to map multiple libraries or assemblies, of different multiplexed organisms, to a concatenated reference containing one fused scaffold per organism. This will convert all of the resulting stats files (one per library) to a single text file, with multiple columns, indicating how much of the input hit the primary versus nonprimary scaffolds.

Basic Usage

summarizeseal.sh in=<file,file...> out=<file>

# Alternative usage with wildcards
summarizeseal.sh *.txt out=out.txt

SummarizeSeal processes multiple stats files from Seal mapping results and produces a consolidated summary showing contamination levels between different organisms or libraries.

Reference Name Format

When using ignoresametaxa, ignoresamebarcode, or ignoresamelocation parameters, reference names must follow this specific format:

barcode,library,tax,location

Example: 6-G,N0296,gammaproteobacteria_bacterium,deep_ocean

Parameters

Parameters control input/output handling and contamination filtering options.

Input/Output Parameters

in=<file>
A list of stats files, or a text file containing one stats file name per line. Can also specify multiple files separated by commas. Stats files should be output from Seal mapping operations.
out=<file>
Destination for summary output. The output will be a tab-delimited file with columns for File, Primary_Name, Primary_Count, Other_Count, Primary_Bases, Other_Bases, and Other_ppm.

Output Control Parameters

printtotal=t
(pt) Print a line summarizing the total contamination rate of all assemblies. When enabled, adds a "TOTAL" row that aggregates statistics across all input files. Default: true
totaldenominator=f
(td) Use all bases as denominator rather than mapped bases when calculating contamination rates (ppm). When false, uses only mapped bases (primary + other) as denominator. When true, uses total bases from input. Default: false

Contamination Filtering Parameters

ignoresametaxa=f
Ignore secondary hits sharing taxonomy. When enabled, hits to references with similar taxonomic classifications will not be counted as contamination. Requires reference names in the specified format. Default: false
ignoresamebarcode=f
Ignore secondary hits sharing a barcode. When enabled, hits to references with the same barcode identifier will not be counted as contamination. Useful for multiplexed samples where cross-barcode contamination is of interest. Default: false
ignoresamelocation=f
Ignore secondary hits sharing a sampling site. When enabled, hits to references from the same sampling location will not be counted as contamination. Useful for environmental samples where geographic cross-contamination is expected. Default: false

Examples

Basic Usage

# Process multiple stats files
summarizeseal.sh in=sample1_stats.txt,sample2_stats.txt,sample3_stats.txt out=contamination_summary.txt

# Process all .txt files in current directory
summarizeseal.sh *.txt out=all_samples_summary.txt

Basic usage for consolidating multiple Seal stats files into a single contamination summary.

Advanced Filtering

# Ignore contamination between same taxa
summarizeseal.sh in=stats_list.txt out=filtered_summary.txt ignoresametaxa=t

# Ignore contamination between same barcodes and locations
summarizeseal.sh in=environmental_stats.txt out=cross_location_summary.txt ignoresamebarcode=t ignoresamelocation=t

# Use total bases for contamination calculation
summarizeseal.sh in=mapping_results.txt out=total_based_summary.txt totaldenominator=t printtotal=t

Advanced usage with contamination filtering and different calculation methods.

File List Input

# Create a file list
echo "sample1_seal_stats.txt" > file_list.txt
echo "sample2_seal_stats.txt" >> file_list.txt
echo "sample3_seal_stats.txt" >> file_list.txt

# Process using file list
summarizeseal.sh in=file_list.txt out=batch_summary.txt

Using a text file containing a list of stats files for batch processing.

Output Format

The output is a tab-delimited file with the following columns:

Sample Output

#File	Primary_Name	Primary_Count	Other_Count	Primary_Bases	Other_Bases	Other_ppm
TOTAL	6-G,N0296,gammaproteobacteria_bacterium,deep_ocean	45823	1247	4582300	124700	26513.84
sample1_stats.txt	6-G,N0296,gammaproteobacteria_bacterium,deep_ocean	15274	423	1527400	42300	26943.37
sample2_stats.txt	7-A,N0298,alphaproteobacteria_sp,surface_water	18359	534	1835900	53400	28316.45
sample3_stats.txt	8-C,N0301,betaproteobacteria_strain,mid_depth	12190	290	1219000	29000	23213.51

Example output showing contamination summary for multiple samples with TOTAL row included.

Algorithm Details

Processing Strategy

SummarizeSeal implements a two-phase algorithm for contamination analysis:

Phase 1: Primary Reference Selection

For each input stats file, the algorithm identifies the primary reference using a dual-criteria selection:

Phase 2: Contamination Calculation

The contamination rate is calculated using different denominators based on the totaldenominator setting:

Advanced Filtering Logic

When contamination filtering options are enabled, the algorithm applies string-based matching algorithms:

Taxonomic Filtering (ignoresametaxa=t)

Uses String.contains() method for taxonomic classifications in name[2] field. Two references are considered the same taxa if either taxonomy contains the other as a substring. This handles cases where taxonomic names have different levels of specificity.

Barcode Filtering (ignoresamebarcode=t)

Parses barcode identifiers using String.split("-") and compares both barcode[0] and barcode[1] components using equals() method. References sharing either barcode component are filtered out.

Location Filtering (ignoresamelocation=t)

Performs exact string matching using equals() method on name[3] location field. References from identical sampling sites are not counted as contamination sources.

Data Structure Implementation

The algorithm uses ArrayList<SealSummary> collections with TextFile streaming:

Memory Characteristics

The implementation processes files with controlled memory usage:

Output Precision

Contamination rates are calculated and formatted with specific precision:

Technical Notes

Input Requirements

Reference Name Parsing

Performance Characteristics

Error Handling

Support

For questions and support: