SummarizeSeal
Summarizes the stats output of Seal for evaluation of cross-contamination. The intended use is to map multiple libraries or assemblies, of different multiplexed organisms, to a concatenated reference containing one fused scaffold per organism. This will convert all of the resulting stats files (one per library) to a single text file, with multiple columns, indicating how much of the input hit the primary versus nonprimary scaffolds.
Basic Usage
summarizeseal.sh in=<file,file...> out=<file>
# Alternative usage with wildcards
summarizeseal.sh *.txt out=out.txt
SummarizeSeal processes multiple stats files from Seal mapping results and produces a consolidated summary showing contamination levels between different organisms or libraries.
Reference Name Format
When using ignoresametaxa
, ignoresamebarcode
, or ignoresamelocation
parameters, reference names must follow this specific format:
barcode,library,tax,location
Example: 6-G,N0296,gammaproteobacteria_bacterium,deep_ocean
- barcode: Sample barcode identifier (e.g., "6-G")
- library: Library identifier (e.g., "N0296")
- tax: Taxonomic classification (e.g., "gammaproteobacteria_bacterium")
- location: Sampling site (e.g., "deep_ocean")
Parameters
Parameters control input/output handling and contamination filtering options.
Input/Output Parameters
- in=<file>
- A list of stats files, or a text file containing one stats file name per line. Can also specify multiple files separated by commas. Stats files should be output from Seal mapping operations.
- out=<file>
- Destination for summary output. The output will be a tab-delimited file with columns for File, Primary_Name, Primary_Count, Other_Count, Primary_Bases, Other_Bases, and Other_ppm.
Output Control Parameters
- printtotal=t
- (pt) Print a line summarizing the total contamination rate of all assemblies. When enabled, adds a "TOTAL" row that aggregates statistics across all input files. Default: true
- totaldenominator=f
- (td) Use all bases as denominator rather than mapped bases when calculating contamination rates (ppm). When false, uses only mapped bases (primary + other) as denominator. When true, uses total bases from input. Default: false
Contamination Filtering Parameters
- ignoresametaxa=f
- Ignore secondary hits sharing taxonomy. When enabled, hits to references with similar taxonomic classifications will not be counted as contamination. Requires reference names in the specified format. Default: false
- ignoresamebarcode=f
- Ignore secondary hits sharing a barcode. When enabled, hits to references with the same barcode identifier will not be counted as contamination. Useful for multiplexed samples where cross-barcode contamination is of interest. Default: false
- ignoresamelocation=f
- Ignore secondary hits sharing a sampling site. When enabled, hits to references from the same sampling location will not be counted as contamination. Useful for environmental samples where geographic cross-contamination is expected. Default: false
Examples
Basic Usage
# Process multiple stats files
summarizeseal.sh in=sample1_stats.txt,sample2_stats.txt,sample3_stats.txt out=contamination_summary.txt
# Process all .txt files in current directory
summarizeseal.sh *.txt out=all_samples_summary.txt
Basic usage for consolidating multiple Seal stats files into a single contamination summary.
Advanced Filtering
# Ignore contamination between same taxa
summarizeseal.sh in=stats_list.txt out=filtered_summary.txt ignoresametaxa=t
# Ignore contamination between same barcodes and locations
summarizeseal.sh in=environmental_stats.txt out=cross_location_summary.txt ignoresamebarcode=t ignoresamelocation=t
# Use total bases for contamination calculation
summarizeseal.sh in=mapping_results.txt out=total_based_summary.txt totaldenominator=t printtotal=t
Advanced usage with contamination filtering and different calculation methods.
File List Input
# Create a file list
echo "sample1_seal_stats.txt" > file_list.txt
echo "sample2_seal_stats.txt" >> file_list.txt
echo "sample3_seal_stats.txt" >> file_list.txt
# Process using file list
summarizeseal.sh in=file_list.txt out=batch_summary.txt
Using a text file containing a list of stats files for batch processing.
Output Format
The output is a tab-delimited file with the following columns:
- File: Input stats file name
- Primary_Name: Name of the reference with the highest hit count/bases
- Primary_Count: Number of reads mapped to the primary reference
- Other_Count: Total number of reads mapped to non-primary references
- Primary_Bases: Total bases mapped to the primary reference
- Other_Bases: Total bases mapped to non-primary references
- Other_ppm: Parts per million contamination rate (other bases relative to total mapped bases or total bases, depending on totaldenominator setting)
Sample Output
#File Primary_Name Primary_Count Other_Count Primary_Bases Other_Bases Other_ppm
TOTAL 6-G,N0296,gammaproteobacteria_bacterium,deep_ocean 45823 1247 4582300 124700 26513.84
sample1_stats.txt 6-G,N0296,gammaproteobacteria_bacterium,deep_ocean 15274 423 1527400 42300 26943.37
sample2_stats.txt 7-A,N0298,alphaproteobacteria_sp,surface_water 18359 534 1835900 53400 28316.45
sample3_stats.txt 8-C,N0301,betaproteobacteria_strain,mid_depth 12190 290 1219000 29000 23213.51
Example output showing contamination summary for multiple samples with TOTAL row included.
Algorithm Details
Processing Strategy
SummarizeSeal implements a two-phase algorithm for contamination analysis:
Phase 1: Primary Reference Selection
For each input stats file, the algorithm identifies the primary reference using a dual-criteria selection:
- Primary criterion: Reference with the highest number of mapped bases
- Tie-breaking criterion: If bases are equal, reference with the highest read count wins
- All other references are classified as "other" (potential contamination sources)
Phase 2: Contamination Calculation
The contamination rate is calculated using different denominators based on the totaldenominator
setting:
- Default mode (totaldenominator=f):
ppm = other_bases × 1,000,000 / (primary_bases + other_bases)
- Total denominator mode (totaldenominator=t):
ppm = other_bases × 1,000,000 / total_input_bases
Advanced Filtering Logic
When contamination filtering options are enabled, the algorithm applies string-based matching algorithms:
Taxonomic Filtering (ignoresametaxa=t)
Uses String.contains() method for taxonomic classifications in name[2] field. Two references are considered the same taxa if either taxonomy contains the other as a substring. This handles cases where taxonomic names have different levels of specificity.
Barcode Filtering (ignoresamebarcode=t)
Parses barcode identifiers using String.split("-") and compares both barcode[0] and barcode[1] components using equals() method. References sharing either barcode component are filtered out.
Location Filtering (ignoresamelocation=t)
Performs exact string matching using equals() method on name[3] location field. References from identical sampling sites are not counted as contamination sources.
Data Structure Implementation
The algorithm uses ArrayList<SealSummary> collections with TextFile streaming:
- SealSummary class: Encapsulates per-file statistics using long counters (pcount, ocount, tcount, pbases, obases, tbases)
- TextFile streaming: Processes stats files line-by-line using TextFile.nextLine() to minimize memory usage
- Aggregation strategy: Maintains separate long counters for primary and other hits, using simple addition for totaling
Memory Characteristics
The implementation processes files with controlled memory usage:
- Default memory allocation: 120MB heap space (-Xmx120m)
- TextFile streaming: Files are processed line-by-line using TextFile.nextLine() without loading entire datasets into memory
- ArrayList<SealSummary> storage: Only SealSummary objects containing long counters are retained in memory during processing
Output Precision
Contamination rates are calculated and formatted with specific precision:
- PPM calculation: Uses double-precision arithmetic (obases*1000000.0/(obases+pbases)) for accuracy
- Output formatting: Results are displayed with 2 decimal places using Tools.format("%.2f") method
- Zero contamination handling: Explicit check (obases==0 ? 0 : calculation) prevents division by zero
Technical Notes
Input Requirements
- Stats files must be in the format produced by Seal (or compatible tools)
- Expected format includes tab-delimited columns with reference names, read counts, and base counts
- Total counts should be prefixed with "#Total" in the stats files
Reference Name Parsing
- Reference names are converted to lowercase for consistent comparison
- Barcode parsing expects format: "prefix-suffix" (e.g., "6-G")
- Field parsing uses comma-separated format: "barcode,library,taxonomy,location"
Performance Characteristics
- Time complexity: O(n×m) where n is number of files and m is average lines per file
- Space complexity: O(n) for storing summary data for each input file
- I/O pattern: Sequential reads with single-pass processing per file
Error Handling
- Invalid reference name formats trigger assertion failures with diagnostic messages
- Missing or malformed input files are handled gracefully
- Unknown parameters result in RuntimeException with clear error messages
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org