SummarizeCoverage
Summarizes coverage information from basecov files created by pileup.sh. These files contain per-base coverage information and should be named like 'sample1_basecov.txt', though other naming styles are acceptable.
Basic Usage
summarizecoverage.sh *basecov.txt out=<output file>
Process multiple basecov files and generate a summary table with coverage statistics.
Parameters
Parameters control input files, output format, and reference length calculations for coverage statistics.
Input/Output Parameters
- in=<file>
- Input basecov file. The 'in=' prefix is optional - any filename used as a parameter will be assumed to be an input basecov file. Multiple files can be processed by specifying them as separate parameters.
- out=<file>
- Write the summary table to this file. Default is stdout. The output is tab-delimited with columns for sample name, average coverage, and percentage of bases at various coverage thresholds.
Reference Parameters
- reflen=-1
- Reference length in bases. If positive, use this as the total reference length for percentage calculations. If negative (default), assume basecov files report coverage for every reference base. This parameter is useful when basecov files only contain entries for covered bases.
Processing Parameters
- lines=<long>
- Maximum number of lines to process from each basecov file. Default is unlimited (Long.MAX_VALUE). Use this to limit processing for testing or when working with very large files.
- verbose=false
- Enable verbose output for debugging. Shows additional information about file processing and internal operations.
Output Format
The tool generates a tab-delimited table with the following columns:
- Sample: Sample name derived from the input filename (with '_basecov' suffix removed)
- AvgCov: Average coverage across all reference positions
- %>=1x: Percentage of bases with coverage ≥1
- %>=2x: Percentage of bases with coverage ≥2
- %>=3x: Percentage of bases with coverage ≥3
- %>=4x: Percentage of bases with coverage ≥4
- %>=5x: Percentage of bases with coverage ≥5
- %>=10x: Percentage of bases with coverage ≥10
- %>=20x: Percentage of bases with coverage ≥20
Examples
Basic Coverage Summary
summarizecoverage.sh sample1_basecov.txt sample2_basecov.txt out=coverage_summary.txt
Process two basecov files and write the coverage summary to a file.
Wildcard Processing
summarizecoverage.sh *_basecov.txt out=all_samples_coverage.txt
Process all basecov files in the current directory matching the pattern and create a comprehensive summary.
With Reference Length
summarizecoverage.sh sample_basecov.txt reflen=4641652 out=summary.txt
Process coverage data with a known reference genome length for accurate percentage calculations when basecov files only contain covered positions.
Limited Processing
summarizecoverage.sh large_basecov.txt lines=1000000 out=partial_summary.txt
Process only the first million lines of a large basecov file for quick analysis.
Algorithm Details
The SummarizeCoverage tool implements a histogram-based algorithm using a fixed 21-element array (coverage values 0-20) for coverage analysis:
Coverage Histogram Processing
The algorithm maintains a coverage histogram with bins from 0 to 20, where coverage values above 20 are capped at 20. For each basecov file:
- Parses each line to extract the coverage value from the last tab-delimited column
- Accumulates coverage values to calculate total coverage sum
- Builds a histogram of coverage frequencies up to 20x
- Converts the histogram to cumulative form by iterating backwards from index 20 to 1
Statistical Calculations
The tool calculates coverage statistics using the following approach:
- Reference Length: Uses either the provided reflen parameter or the total number of positions in the basecov file
- Average Coverage: Total coverage sum divided by reference length
- Coverage Thresholds: Cumulative histogram values provide the count of bases at or above each threshold (1x, 2x, 3x, 4x, 5x, 10x, 20x)
- Percentage Conversion: Threshold counts are converted to percentages using the reference length
Sample Name Processing
Sample names are automatically derived from input filenames:
- Strips the file path to get the core filename
- Removes the '_basecov' suffix if present
- Uses the resulting string as the sample identifier in the output table
Memory Implementation
The algorithm uses specific memory management strategies:
- Processes files line-by-line using ByteFile streaming without loading entire files into memory
- Uses a fixed-size histogram array (21 long elements) regardless of file size
- Default JVM heap allocation is 200MB (-Xmx200m), sufficient for processing coverage files of any size
- Processes multiple files sequentially to avoid memory conflicts
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org