SummarizeScafstats
Summarizes the scafstats output of BBMap for evaluation of cross-contamination. The intended use is to map multiple libraries or assemblies, of different multiplexed organisms, to a concatenated reference containing one fused scaffold per organism. This will convert all of the resulting stats files (one per library) to a single text file, with multiple columns, indicating how much of the input hit the primary versus nonprimary scaffolds.
Basic Usage
summarizescafstats.sh in=<file,file...> out=<file>
You can alternatively use a wildcard, like this:
summarizescafstats.sh scafstats_*.txt out=summary.txt
Parameters
This tool has a simple parameter set designed for processing multiple scafstats files generated by BBMap's mapping statistics output.
Input/Output Parameters
- in=<file>
- A list of stats files, or a text file containing one stats file name per line. Can be specified as comma-separated files or using wildcards.
- out=<file>
- Destination for summary output. Tab-delimited file containing columns for File, Primary_Name, Primary_Count, Other_Count, Primary_MB, and Other_MB.
Examples
Processing Multiple Stats Files
summarizescafstats.sh in=lib1_stats.txt,lib2_stats.txt,lib3_stats.txt out=contamination_summary.txt
Processes three individual stats files and creates a summary table.
Using Wildcard Pattern
summarizescafstats.sh scafstats_*.txt out=summary.txt
Processes all scafstats files matching the pattern and creates a single summary.
Input List File
summarizescafstats.sh in=filelist.txt out=contamination_report.txt
Reads a text file containing one stats filename per line and processes all listed files.
Algorithm Details
SummarizeCoverage implements line-by-line processing of scafstats files, parsing tab-delimited columns to identify primary scaffolds based on coverage values:
Primary Scaffold Identification
For each input stats file, the tool identifies the "primary" scaffold using a dual-criteria approach:
- MB coverage comparison: Compares split[2] values (megabase coverage) using mb>pmb condition
- Count tiebreaker: Uses count>pcount condition when coverage values are equal (mb==pmb)
- Primary reassignment: When a new primary is found, previous primary values are added to "other" totals via ocount+=pcount and omb+=pmb operations
Coverage Aggregation Strategy
The tool uses accumulative addition operations for non-primary scaffolds:
- Other count aggregation: Non-primary scaffold read counts are accumulated using ocount+=count operations
- Other coverage aggregation: Non-primary scaffold MB values are accumulated using omb+=mb operations
- Primary promotion handling: When a scaffold becomes the new primary, previous primary statistics are transferred to "other" category using ocount+=pcount and omb+=pmb
Output Format
The tool generates tab-delimited output using Tools.format() method with six columns:
- File: Original stats filename (fname)
- Primary_Name: Name of the scaffold with highest coverage (pname from split[0])
- Primary_Count: Read count mapping to the primary scaffold (pcount from split[5])
- Other_Count: Aggregated read count for all non-primary scaffolds (ocount)
- Primary_MB: Coverage in megabases for the primary scaffold (pmb from split[2])
- Other_MB: Aggregated coverage in megabases for all non-primary scaffolds (omb)
Cross-Contamination Detection
This output format enables contamination analysis by:
- Primary vs. secondary ratios: High Other_Count/Primary_Count ratios may indicate contamination
- Coverage distribution analysis: Unexpected patterns in Primary_MB vs. Other_MB can reveal cross-contamination
- Library comparison: Comparing ratios across multiple libraries helps identify systematic contamination issues
Memory and Performance
The tool implements single-pass file processing characteristics:
- Memory allocation: Default maximum memory of 120MB (z="-Xmx120m" in shell script)
- Sequential processing: Each stats file is processed via TextFile.nextLine() iterator without loading entire file into memory
- Single-pass algorithm: Each input file is read once using the for loop in process() method
Use Cases
Multiplexed Organism Analysis
The primary use case is evaluation of cross-contamination in multiplexed sequencing experiments where:
- Multiple organisms are sequenced together
- Each organism has a dedicated scaffold in a concatenated reference
- Reads should primarily map to their organism's scaffold
- Cross-mapping indicates potential contamination
Quality Control Workflow
Typical workflow for contamination assessment:
- Create concatenated reference with one scaffold per organism
- Map each library to the concatenated reference using BBMap
- Generate scafstats output for each mapping
- Use summarizescafstats to consolidate results
- Analyze Primary/Other ratios to identify contamination
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org