SummarizeCoverage

Script: summarizecoverage.sh Package: covid Class: SummarizeCoverage.java

Summarizes coverage information from basecov files created by pileup.sh. These files contain per-base coverage information and should be named like 'sample1_basecov.txt', though other naming styles are acceptable.

Basic Usage

summarizecoverage.sh *basecov.txt out=<output file>

Process multiple basecov files and generate a summary table with coverage statistics.

Parameters

Parameters control input files, output format, and reference length calculations for coverage statistics.

Input/Output Parameters

in=<file>
Input basecov file. The 'in=' prefix is optional - any filename used as a parameter will be assumed to be an input basecov file. Multiple files can be processed by specifying them as separate parameters.
out=<file>
Write the summary table to this file. Default is stdout. The output is tab-delimited with columns for sample name, average coverage, and percentage of bases at various coverage thresholds.

Reference Parameters

reflen=-1
Reference length in bases. If positive, use this as the total reference length for percentage calculations. If negative (default), assume basecov files report coverage for every reference base. This parameter is useful when basecov files only contain entries for covered bases.

Processing Parameters

lines=<long>
Maximum number of lines to process from each basecov file. Default is unlimited (Long.MAX_VALUE). Use this to limit processing for testing or when working with very large files.
verbose=false
Enable verbose output for debugging. Shows additional information about file processing and internal operations.

Output Format

The tool generates a tab-delimited table with the following columns:

Examples

Basic Coverage Summary

summarizecoverage.sh sample1_basecov.txt sample2_basecov.txt out=coverage_summary.txt

Process two basecov files and write the coverage summary to a file.

Wildcard Processing

summarizecoverage.sh *_basecov.txt out=all_samples_coverage.txt

Process all basecov files in the current directory matching the pattern and create a comprehensive summary.

With Reference Length

summarizecoverage.sh sample_basecov.txt reflen=4641652 out=summary.txt

Process coverage data with a known reference genome length for accurate percentage calculations when basecov files only contain covered positions.

Limited Processing

summarizecoverage.sh large_basecov.txt lines=1000000 out=partial_summary.txt

Process only the first million lines of a large basecov file for quick analysis.

Algorithm Details

The SummarizeCoverage tool implements a histogram-based algorithm using a fixed 21-element array (coverage values 0-20) for coverage analysis:

Coverage Histogram Processing

The algorithm maintains a coverage histogram with bins from 0 to 20, where coverage values above 20 are capped at 20. For each basecov file:

Statistical Calculations

The tool calculates coverage statistics using the following approach:

Sample Name Processing

Sample names are automatically derived from input filenames:

Memory Implementation

The algorithm uses specific memory management strategies:

Support

For questions and support: