COVID-19 Analysis Summary

Script: makeSummary.sh Source Directory: pipelines/covid/ Author: Brian Bushnell Last Updated: April 11, 2020

Pipeline script to summarize and aggregate all SARS-CoV-2 analysis results from processCorona.sh runs. This script performs multi-sample variant calling, coverage summarization, and result packaging for COVID-19 genomic surveillance projects.

Overview

The makeSummary.sh script is designed to aggregate and summarize the results from multiple COVID-19 sample analyses performed by processCorona.sh. It performs multi-sample variant calling across all processed samples, generates comprehensive coverage summaries, and packages all results for sharing or storage.

Important: This script should only be run after all individual samples have been processed using processCorona.sh. It requires the presence of specific input files generated by the individual sample processing pipeline.

Prerequisites

Required Input Files

The following files must be present in the working directory before running makeSummary.sh:

System Requirements

Configuration Parameters

The script uses two key configuration variables that can be modified at the top of the script:

MINCOV - Minimum Coverage Threshold

MINCOV=5

Sets the minimum read depth required for variant calling. Positions with coverage below this threshold will not be considered for variant detection. Default value of 5 provides a balance between sensitivity and reliability.

REF - Reference Genome

REF="NC_045512.fasta"

Specifies the viral reference genome file. NC_045512.fasta contains the complete SARS-CoV-2 genome sequence and is equivalent to bbmap/resources/Covid19_ref.fa.

Pipeline Stages

1. Multi-Sample Variant Calling

callvariants.sh *_deduped_trimclip.sam.gz ref="$REF" multisample out=allVars.vcf ow -Xmx4g \
    usebias=f strandedcov minstrandratio=0 maf=0.6 minreads="$MINCOV" mincov="$MINCOV" \
    minedistmax=30 minedist=16 flagnearby

Performs comprehensive variant calling across all samples simultaneously. This multi-sample approach provides several advantages:

Variant Calling Parameters

multisample
Enables multi-sample mode for joint variant calling across all input files
usebias=f
Disables bias correction, appropriate for viral genomes with expected coverage variations
strandedcov
Calculates strand-specific coverage for more accurate variant detection
minstrandratio=0
Allows variants supported by reads from only one strand (relaxed for viral analysis)
maf=0.6
Minor allele frequency threshold of 60% - variants must be present in majority of reads
minreads=5, mincov=5
Minimum read support and coverage depth required for variant calling
minedistmax=30, minedist=16
Quality filters based on edit distance to reduce false positives
flagnearby
Flags variants that are close to each other, which may indicate alignment issues

2. Coverage Summary Generation

summarizecoverage.sh *basecov_border5.txt out=coverageSummary.txt

Aggregates coverage statistics from all individual sample analyses. This step:

3. Result Organization and Packaging

mkdir output
cp *.sh output
cp *.bam* output
cp *.txt output
cp *.vcf output
cp *genome*.fa output

rm results.tar
tar -cf results.tar output

Organizes all analysis results into a standardized output structure and creates a compressed archive:

Basic Usage

# 1. Ensure all individual samples have been processed with processCorona.sh
# 2. Verify required files are present in the working directory
ls *_deduped_trimclip.sam.gz *basecov_border5.txt NC_045512.fasta

# 3. Run the summary pipeline
bash makeSummary.sh

# 4. Results will be in the output/ directory and packaged as results.tar

Output Files

The script generates several key output files, all organized in the output/ directory:

Primary Analysis Results

Copied Analysis Files

Multi-Sample Variant Analysis

The multi-sample variant calling approach provides several analytical advantages for COVID-19 surveillance:

Comprehensive Genotype Matrix

The allVars.vcf file contains genotypes for ALL samples at EVERY variant position found in ANY sample. This enables:

Population-Level Insights

The joint calling approach reveals:

Coverage Analysis Details

The coverage summary provides critical quality metrics for the entire sample set:

Depth Distribution Analysis

Border Region Analysis

The use of *basecov_border5.txt files indicates special handling of:

Quality Control and Validation

The pipeline includes several quality control measures:

Variant Quality Filters

Result Validation

Customization Options

Key parameters can be modified for different analysis requirements:

Coverage Thresholds

# Increase stringency for high-quality samples
MINCOV=10

# Decrease for low-coverage samples
MINCOV=3

Variant Calling Stringency

# More stringent MAF for population studies
maf=0.8

# More permissive for outbreak investigation
maf=0.3

Alternative Reference Genomes

# Use alternative SARS-CoV-2 reference
REF="alternative_covid_ref.fa"

# Use BBTools built-in reference
REF="bbmap/resources/Covid19_ref.fa"

Integration with COVID-19 Workflows

This script is designed as the final step in a comprehensive COVID-19 analysis workflow:

  1. Individual processing: Run processCorona.sh for each sample
  2. Quality assessment: Review individual sample results
  3. Summary generation: Run makeSummary.sh to aggregate results
  4. Population analysis: Use allVars.vcf for phylogenetic analysis
  5. Surveillance reporting: Generate reports from coverageSummary.txt

Troubleshooting

Common Issues

Validation Steps

# Check input file presence
ls *_deduped_trimclip.sam.gz | wc -l
ls *basecov_border5.txt | wc -l

# Verify VCF output quality
grep -v "^#" allVars.vcf | wc -l

# Check coverage summary completeness
head -20 coverageSummary.txt

Performance Considerations