SummarizeScafstats

Script: summarizescafstats.sh Package: driver Class: SummarizeCoverage.java

Summarizes the scafstats output of BBMap for evaluation of cross-contamination. The intended use is to map multiple libraries or assemblies, of different multiplexed organisms, to a concatenated reference containing one fused scaffold per organism. This will convert all of the resulting stats files (one per library) to a single text file, with multiple columns, indicating how much of the input hit the primary versus nonprimary scaffolds.

Basic Usage

summarizescafstats.sh in=<file,file...> out=<file>

You can alternatively use a wildcard, like this:

summarizescafstats.sh scafstats_*.txt out=summary.txt

Parameters

This tool has a simple parameter set designed for processing multiple scafstats files generated by BBMap's mapping statistics output.

Input/Output Parameters

in=<file>: A list of stats files, or a text file containing one stats file name per line. Can be specified as comma-separated files or using wildcards.
out=<file>: Destination for summary output. Tab-delimited file containing columns for File, Primary_Name, Primary_Count, Other_Count, Primary_MB, and Other_MB.

Examples

Processing Multiple Stats Files

summarizescafstats.sh in=lib1_stats.txt,lib2_stats.txt,lib3_stats.txt out=contamination_summary.txt

Processes three individual stats files and creates a summary table.

Using Wildcard Pattern

summarizescafstats.sh scafstats_*.txt out=summary.txt

Processes all scafstats files matching the pattern and creates a single summary.

Input List File

summarizescafstats.sh in=filelist.txt out=contamination_report.txt

Reads a text file containing one stats filename per line and processes all listed files.

Algorithm Details

SummarizeCoverage implements line-by-line processing of scafstats files, parsing tab-delimited columns to identify primary scaffolds based on coverage values:

Primary Scaffold Identification

For each input stats file, the tool identifies the "primary" scaffold using a dual-criteria approach:

MB coverage comparison: Compares split[2] values (megabase coverage) using mb>pmb condition
Count tiebreaker: Uses count>pcount condition when coverage values are equal (mb==pmb)
Primary reassignment: When a new primary is found, previous primary values are added to "other" totals via ocount+=pcount and omb+=pmb operations

Coverage Aggregation Strategy

The tool uses accumulative addition operations for non-primary scaffolds:

Other count aggregation: Non-primary scaffold read counts are accumulated using ocount+=count operations
Other coverage aggregation: Non-primary scaffold MB values are accumulated using omb+=mb operations
Primary promotion handling: When a scaffold becomes the new primary, previous primary statistics are transferred to "other" category using ocount+=pcount and omb+=pmb

Output Format

The tool generates tab-delimited output using Tools.format() method with six columns:

File: Original stats filename (fname)
Primary_Name: Name of the scaffold with highest coverage (pname from split[0])
Primary_Count: Read count mapping to the primary scaffold (pcount from split[5])
Other_Count: Aggregated read count for all non-primary scaffolds (ocount)
Primary_MB: Coverage in megabases for the primary scaffold (pmb from split[2])
Other_MB: Aggregated coverage in megabases for all non-primary scaffolds (omb)

Cross-Contamination Detection

This output format enables contamination analysis by:

Primary vs. secondary ratios: High Other_Count/Primary_Count ratios may indicate contamination
Coverage distribution analysis: Unexpected patterns in Primary_MB vs. Other_MB can reveal cross-contamination
Library comparison: Comparing ratios across multiple libraries helps identify systematic contamination issues

Memory and Performance

The tool implements single-pass file processing characteristics:

Memory allocation: Default maximum memory of 120MB (z="-Xmx120m" in shell script)
Sequential processing: Each stats file is processed via TextFile.nextLine() iterator without loading entire file into memory
Single-pass algorithm: Each input file is read once using the for loop in process() method

Use Cases

Multiplexed Organism Analysis

The primary use case is evaluation of cross-contamination in multiplexed sequencing experiments where:

Multiple organisms are sequenced together
Each organism has a dedicated scaffold in a concatenated reference
Reads should primarily map to their organism's scaffold
Cross-mapping indicates potential contamination

Quality Control Workflow

Typical workflow for contamination assessment:

Create concatenated reference with one scaffold per organism
Map each library to the concatenated reference using BBMap
Generate scafstats output for each mapping
Use summarizescafstats to consolidate results
Analyze Primary/Other ratios to identify contamination

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org