COVID-19 Analysis Summary

Overview

The makeSummary.sh script is designed to aggregate and summarize the results from multiple COVID-19 sample analyses performed by processCorona.sh. It performs multi-sample variant calling across all processed samples, generates comprehensive coverage summaries, and packages all results for sharing or storage.

Important: This script should only be run after all individual samples have been processed using processCorona.sh. It requires the presence of specific input files generated by the individual sample processing pipeline.

Prerequisites

Required Input Files

The following files must be present in the working directory before running makeSummary.sh:

*_deduped_trimclip.sam.gz - Processed alignment files from each sample
*basecov_border5.txt - Base coverage files from each sample
NC_045512.fasta - SARS-CoV-2 reference genome

System Requirements

BBTools suite with callvariants.sh and summarizecoverage.sh
Minimum 4GB RAM for variant calling (-Xmx4g)
Sufficient disk space for temporary files and tar archive

Configuration Parameters

The script uses two key configuration variables that can be modified at the top of the script:

MINCOV - Minimum Coverage Threshold

MINCOV=5

Sets the minimum read depth required for variant calling. Positions with coverage below this threshold will not be considered for variant detection. Default value of 5 provides a balance between sensitivity and reliability.

REF - Reference Genome

REF="NC_045512.fasta"

Specifies the viral reference genome file. NC_045512.fasta contains the complete SARS-CoV-2 genome sequence and is equivalent to bbmap/resources/Covid19_ref.fa.

Pipeline Stages

1. Multi-Sample Variant Calling

callvariants.sh *_deduped_trimclip.sam.gz ref="$REF" multisample out=allVars.vcf ow -Xmx4g \
    usebias=f strandedcov minstrandratio=0 maf=0.6 minreads="$MINCOV" mincov="$MINCOV" \
    minedistmax=30 minedist=16 flagnearby

Performs comprehensive variant calling across all samples simultaneously. This multi-sample approach provides several advantages:

Comprehensive genotyping: Reports the genotype of ALL samples at any position where a variant is called in ANY sample
Consistent variant sites: Ensures all samples are evaluated at the same genomic positions
Population-level analysis: Enables comparison of variant frequencies across the sample set

Variant Calling Parameters

multisample: Enables multi-sample mode for joint variant calling across all input files
usebias=f: Disables bias correction, appropriate for viral genomes with expected coverage variations
strandedcov: Calculates strand-specific coverage for more accurate variant detection
minstrandratio=0: Allows variants supported by reads from only one strand (relaxed for viral analysis)
maf=0.6: Minor allele frequency threshold of 60% - variants must be present in majority of reads
minreads=5, mincov=5: Minimum read support and coverage depth required for variant calling
minedistmax=30, minedist=16: Quality filters based on edit distance to reduce false positives
flagnearby: Flags variants that are close to each other, which may indicate alignment issues

2. Coverage Summary Generation

summarizecoverage.sh *basecov_border5.txt out=coverageSummary.txt

Aggregates coverage statistics from all individual sample analyses. This step:

Compiles coverage data from all processed samples
Calculates coverage at various depth thresholds
Provides genome-wide coverage statistics
Identifies regions with inadequate coverage across samples

3. Result Organization and Packaging

mkdir output
cp *.sh output
cp *.bam* output
cp *.txt output
cp *.vcf output
cp *genome*.fa output

rm results.tar
tar -cf results.tar output

Organizes all analysis results into a standardized output structure and creates a compressed archive:

Scripts: Copies all shell scripts for reproducibility
Alignments: Includes BAM files and indexes
Analysis results: Coverage summaries, statistics, and other text outputs
Variants: VCF files with called variants
Genomes: Reference and consensus genome sequences
Archive: Creates results.tar for easy transfer

Basic Usage

# 1. Ensure all individual samples have been processed with processCorona.sh
# 2. Verify required files are present in the working directory
ls *_deduped_trimclip.sam.gz *basecov_border5.txt NC_045512.fasta

# 3. Run the summary pipeline
bash makeSummary.sh

# 4. Results will be in the output/ directory and packaged as results.tar

Output Files

The script generates several key output files, all organized in the output/ directory:

Primary Analysis Results

allVars.vcf - Multi-sample variant calls in VCF format
coverageSummary.txt - Comprehensive coverage statistics across all samples
results.tar - Complete analysis archive ready for transfer

Copied Analysis Files

*.sh - All shell scripts used in the analysis
*.bam* - BAM alignment files and their indexes
*.txt - Coverage files, statistics, and other text-based results
*.vcf - All variant calling files
*genome*.fa - Reference genomes and consensus sequences

Multi-Sample Variant Analysis

The multi-sample variant calling approach provides several analytical advantages for COVID-19 surveillance:

Comprehensive Genotype Matrix

The allVars.vcf file contains genotypes for ALL samples at EVERY variant position found in ANY sample. This enables:

Direct comparison of variant patterns between samples
Identification of sample-specific mutations
Detection of shared variant sites across the population
Quality assessment of variant calls across samples

Population-Level Insights

The joint calling approach reveals:

Variant frequency distribution: How common each variant is across samples
Mutation hotspots: Genomic regions with high variant density
Sample relationships: Phylogenetic clustering based on shared variants
Quality patterns: Systematic issues affecting multiple samples

Coverage Analysis Details

The coverage summary provides critical quality metrics for the entire sample set:

Depth Distribution Analysis

Coverage statistics at multiple depth thresholds
Genome completeness assessment for each sample
Identification of consistently low-coverage regions
Quality control metrics for sequencing uniformity

Border Region Analysis

The use of *basecov_border5.txt files indicates special handling of:

Genome termini that may have reduced coverage
Assembly boundaries and their coverage characteristics
Regions requiring special consideration in downstream analysis

Quality Control and Validation

The pipeline includes several quality control measures:

Variant Quality Filters

Coverage thresholds: Minimum 5x coverage ensures reliability
Allele frequency: 60% MAF reduces false positives
Edit distance filters: Removes variants likely due to alignment errors
Strand bias assessment: Identifies potential technical artifacts

Result Validation

Multi-sample consistency checks
Coverage uniformity assessment
Reference genome integrity verification
Output file completeness validation

Customization Options

Key parameters can be modified for different analysis requirements:

Coverage Thresholds

# Increase stringency for high-quality samples
MINCOV=10

# Decrease for low-coverage samples
MINCOV=3

Variant Calling Stringency

# More stringent MAF for population studies
maf=0.8

# More permissive for outbreak investigation
maf=0.3

Alternative Reference Genomes

# Use alternative SARS-CoV-2 reference
REF="alternative_covid_ref.fa"

# Use BBTools built-in reference
REF="bbmap/resources/Covid19_ref.fa"

Integration with COVID-19 Workflows

This script is designed as the final step in a comprehensive COVID-19 analysis workflow:

Individual processing: Run processCorona.sh for each sample
Quality assessment: Review individual sample results
Summary generation: Run makeSummary.sh to aggregate results
Population analysis: Use allVars.vcf for phylogenetic analysis
Surveillance reporting: Generate reports from coverageSummary.txt

Troubleshooting

Common Issues

Missing input files: Ensure all samples were processed successfully
Memory errors: Increase -Xmx parameter if variant calling fails
Empty VCF output: Check that MINCOV threshold is appropriate for your data
Archive creation fails: Verify sufficient disk space for results.tar

Validation Steps

# Check input file presence
ls *_deduped_trimclip.sam.gz | wc -l
ls *basecov_border5.txt | wc -l

# Verify VCF output quality
grep -v "^#" allVars.vcf | wc -l

# Check coverage summary completeness
head -20 coverageSummary.txt

Performance Considerations

Memory usage: Multi-sample variant calling requires 4GB+ RAM
Processing time: Scales with number of samples and genome coverage
Disk usage: Temporary files may require significant space
I/O performance: Benefits from fast storage for large sample sets