SummarizeCoverage

Script: summarizecoverage.sh Package: covid Class: SummarizeCoverage.java

Summarizes coverage information from basecov files created by pileup.sh. These files contain per-base coverage information and should be named like 'sample1_basecov.txt', though other naming styles are acceptable.

Basic Usage

summarizecoverage.sh *basecov.txt out=<output file>

Process multiple basecov files and generate a summary table with coverage statistics.

Parameters

Parameters control input files, output format, and reference length calculations for coverage statistics.

Input/Output Parameters

in=<file>: Input basecov file. The 'in=' prefix is optional - any filename used as a parameter will be assumed to be an input basecov file. Multiple files can be processed by specifying them as separate parameters.
out=<file>: Write the summary table to this file. Default is stdout. The output is tab-delimited with columns for sample name, average coverage, and percentage of bases at various coverage thresholds.

Reference Parameters

reflen=-1: Reference length in bases. If positive, use this as the total reference length for percentage calculations. If negative (default), assume basecov files report coverage for every reference base. This parameter is useful when basecov files only contain entries for covered bases.

Processing Parameters

lines=<long>: Maximum number of lines to process from each basecov file. Default is unlimited (Long.MAX_VALUE). Use this to limit processing for testing or when working with very large files.
verbose=false: Enable verbose output for debugging. Shows additional information about file processing and internal operations.

Output Format

The tool generates a tab-delimited table with the following columns:

Sample: Sample name derived from the input filename (with '_basecov' suffix removed)
AvgCov: Average coverage across all reference positions
%>=1x: Percentage of bases with coverage ≥1
%>=2x: Percentage of bases with coverage ≥2
%>=3x: Percentage of bases with coverage ≥3
%>=4x: Percentage of bases with coverage ≥4
%>=5x: Percentage of bases with coverage ≥5
%>=10x: Percentage of bases with coverage ≥10
%>=20x: Percentage of bases with coverage ≥20

Examples

Basic Coverage Summary

summarizecoverage.sh sample1_basecov.txt sample2_basecov.txt out=coverage_summary.txt

Process two basecov files and write the coverage summary to a file.

Wildcard Processing

summarizecoverage.sh *_basecov.txt out=all_samples_coverage.txt

Process all basecov files in the current directory matching the pattern and create a comprehensive summary.

With Reference Length

summarizecoverage.sh sample_basecov.txt reflen=4641652 out=summary.txt

Process coverage data with a known reference genome length for accurate percentage calculations when basecov files only contain covered positions.

Limited Processing

summarizecoverage.sh large_basecov.txt lines=1000000 out=partial_summary.txt

Process only the first million lines of a large basecov file for quick analysis.

Algorithm Details

The SummarizeCoverage tool implements a histogram-based algorithm using a fixed 21-element array (coverage values 0-20) for coverage analysis:

Coverage Histogram Processing

The algorithm maintains a coverage histogram with bins from 0 to 20, where coverage values above 20 are capped at 20. For each basecov file:

Parses each line to extract the coverage value from the last tab-delimited column
Accumulates coverage values to calculate total coverage sum
Builds a histogram of coverage frequencies up to 20x
Converts the histogram to cumulative form by iterating backwards from index 20 to 1

Statistical Calculations

The tool calculates coverage statistics using the following approach:

Reference Length: Uses either the provided reflen parameter or the total number of positions in the basecov file
Average Coverage: Total coverage sum divided by reference length
Coverage Thresholds: Cumulative histogram values provide the count of bases at or above each threshold (1x, 2x, 3x, 4x, 5x, 10x, 20x)
Percentage Conversion: Threshold counts are converted to percentages using the reference length

Sample Name Processing

Sample names are automatically derived from input filenames:

Strips the file path to get the core filename
Removes the '_basecov' suffix if present
Uses the resulting string as the sample identifier in the output table

Memory Implementation

The algorithm uses specific memory management strategies:

Processes files line-by-line using ByteFile streaming without loading entire files into memory
Uses a fixed-size histogram array (21 long elements) regardless of file size
Default JVM heap allocation is 200MB (-Xmx200m), sufficient for processing coverage files of any size
Processes multiple files sequentially to avoid memory conflicts

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org