SummarizeScafstats

Script: summarizescafstats.sh Package: driver Class: SummarizeCoverage.java

Summarizes the scafstats output of BBMap for evaluation of cross-contamination. The intended use is to map multiple libraries or assemblies, of different multiplexed organisms, to a concatenated reference containing one fused scaffold per organism. This will convert all of the resulting stats files (one per library) to a single text file, with multiple columns, indicating how much of the input hit the primary versus nonprimary scaffolds.

Basic Usage

summarizescafstats.sh in=<file,file...> out=<file>

You can alternatively use a wildcard, like this:

summarizescafstats.sh scafstats_*.txt out=summary.txt

Parameters

This tool has a simple parameter set designed for processing multiple scafstats files generated by BBMap's mapping statistics output.

Input/Output Parameters

in=<file>
A list of stats files, or a text file containing one stats file name per line. Can be specified as comma-separated files or using wildcards.
out=<file>
Destination for summary output. Tab-delimited file containing columns for File, Primary_Name, Primary_Count, Other_Count, Primary_MB, and Other_MB.

Examples

Processing Multiple Stats Files

summarizescafstats.sh in=lib1_stats.txt,lib2_stats.txt,lib3_stats.txt out=contamination_summary.txt

Processes three individual stats files and creates a summary table.

Using Wildcard Pattern

summarizescafstats.sh scafstats_*.txt out=summary.txt

Processes all scafstats files matching the pattern and creates a single summary.

Input List File

summarizescafstats.sh in=filelist.txt out=contamination_report.txt

Reads a text file containing one stats filename per line and processes all listed files.

Algorithm Details

SummarizeCoverage implements line-by-line processing of scafstats files, parsing tab-delimited columns to identify primary scaffolds based on coverage values:

Primary Scaffold Identification

For each input stats file, the tool identifies the "primary" scaffold using a dual-criteria approach:

Coverage Aggregation Strategy

The tool uses accumulative addition operations for non-primary scaffolds:

Output Format

The tool generates tab-delimited output using Tools.format() method with six columns:

Cross-Contamination Detection

This output format enables contamination analysis by:

Memory and Performance

The tool implements single-pass file processing characteristics:

Use Cases

Multiplexed Organism Analysis

The primary use case is evaluation of cross-contamination in multiplexed sequencing experiments where:

Quality Control Workflow

Typical workflow for contamination assessment:

  1. Create concatenated reference with one scaffold per organism
  2. Map each library to the concatenated reference using BBMap
  3. Generate scafstats output for each mapping
  4. Use summarizescafstats to consolidate results
  5. Analyze Primary/Other ratios to identify contamination

Support

For questions and support: