COVID-19 Analysis Summary
Pipeline script to summarize and aggregate all SARS-CoV-2 analysis results from processCorona.sh runs. This script performs multi-sample variant calling, coverage summarization, and result packaging for COVID-19 genomic surveillance projects.
Overview
The makeSummary.sh script is designed to aggregate and summarize the results from multiple COVID-19 sample analyses performed by processCorona.sh. It performs multi-sample variant calling across all processed samples, generates comprehensive coverage summaries, and packages all results for sharing or storage.
Prerequisites
Required Input Files
The following files must be present in the working directory before running makeSummary.sh:
- *_deduped_trimclip.sam.gz - Processed alignment files from each sample
- *basecov_border5.txt - Base coverage files from each sample
- NC_045512.fasta - SARS-CoV-2 reference genome
System Requirements
- BBTools suite with callvariants.sh and summarizecoverage.sh
- Minimum 4GB RAM for variant calling (-Xmx4g)
- Sufficient disk space for temporary files and tar archive
Configuration Parameters
The script uses two key configuration variables that can be modified at the top of the script:
MINCOV - Minimum Coverage Threshold
MINCOV=5
Sets the minimum read depth required for variant calling. Positions with coverage below this threshold will not be considered for variant detection. Default value of 5 provides a balance between sensitivity and reliability.
REF - Reference Genome
REF="NC_045512.fasta"
Specifies the viral reference genome file. NC_045512.fasta contains the complete SARS-CoV-2 genome sequence and is equivalent to bbmap/resources/Covid19_ref.fa.
Pipeline Stages
1. Multi-Sample Variant Calling
callvariants.sh *_deduped_trimclip.sam.gz ref="$REF" multisample out=allVars.vcf ow -Xmx4g \
usebias=f strandedcov minstrandratio=0 maf=0.6 minreads="$MINCOV" mincov="$MINCOV" \
minedistmax=30 minedist=16 flagnearby
Performs comprehensive variant calling across all samples simultaneously. This multi-sample approach provides several advantages:
- Comprehensive genotyping: Reports the genotype of ALL samples at any position where a variant is called in ANY sample
- Consistent variant sites: Ensures all samples are evaluated at the same genomic positions
- Population-level analysis: Enables comparison of variant frequencies across the sample set
Variant Calling Parameters
- multisample
- Enables multi-sample mode for joint variant calling across all input files
- usebias=f
- Disables bias correction, appropriate for viral genomes with expected coverage variations
- strandedcov
- Calculates strand-specific coverage for more accurate variant detection
- minstrandratio=0
- Allows variants supported by reads from only one strand (relaxed for viral analysis)
- maf=0.6
- Minor allele frequency threshold of 60% - variants must be present in majority of reads
- minreads=5, mincov=5
- Minimum read support and coverage depth required for variant calling
- minedistmax=30, minedist=16
- Quality filters based on edit distance to reduce false positives
- flagnearby
- Flags variants that are close to each other, which may indicate alignment issues
2. Coverage Summary Generation
summarizecoverage.sh *basecov_border5.txt out=coverageSummary.txt
Aggregates coverage statistics from all individual sample analyses. This step:
- Compiles coverage data from all processed samples
- Calculates coverage at various depth thresholds
- Provides genome-wide coverage statistics
- Identifies regions with inadequate coverage across samples
3. Result Organization and Packaging
mkdir output
cp *.sh output
cp *.bam* output
cp *.txt output
cp *.vcf output
cp *genome*.fa output
rm results.tar
tar -cf results.tar output
Organizes all analysis results into a standardized output structure and creates a compressed archive:
- Scripts: Copies all shell scripts for reproducibility
- Alignments: Includes BAM files and indexes
- Analysis results: Coverage summaries, statistics, and other text outputs
- Variants: VCF files with called variants
- Genomes: Reference and consensus genome sequences
- Archive: Creates results.tar for easy transfer
Basic Usage
# 1. Ensure all individual samples have been processed with processCorona.sh
# 2. Verify required files are present in the working directory
ls *_deduped_trimclip.sam.gz *basecov_border5.txt NC_045512.fasta
# 3. Run the summary pipeline
bash makeSummary.sh
# 4. Results will be in the output/ directory and packaged as results.tar
Output Files
The script generates several key output files, all organized in the output/ directory:
Primary Analysis Results
- allVars.vcf - Multi-sample variant calls in VCF format
- coverageSummary.txt - Comprehensive coverage statistics across all samples
- results.tar - Complete analysis archive ready for transfer
Copied Analysis Files
- *.sh - All shell scripts used in the analysis
- *.bam* - BAM alignment files and their indexes
- *.txt - Coverage files, statistics, and other text-based results
- *.vcf - All variant calling files
- *genome*.fa - Reference genomes and consensus sequences
Multi-Sample Variant Analysis
The multi-sample variant calling approach provides several analytical advantages for COVID-19 surveillance:
Comprehensive Genotype Matrix
The allVars.vcf file contains genotypes for ALL samples at EVERY variant position found in ANY sample. This enables:
- Direct comparison of variant patterns between samples
- Identification of sample-specific mutations
- Detection of shared variant sites across the population
- Quality assessment of variant calls across samples
Population-Level Insights
The joint calling approach reveals:
- Variant frequency distribution: How common each variant is across samples
- Mutation hotspots: Genomic regions with high variant density
- Sample relationships: Phylogenetic clustering based on shared variants
- Quality patterns: Systematic issues affecting multiple samples
Coverage Analysis Details
The coverage summary provides critical quality metrics for the entire sample set:
Depth Distribution Analysis
- Coverage statistics at multiple depth thresholds
- Genome completeness assessment for each sample
- Identification of consistently low-coverage regions
- Quality control metrics for sequencing uniformity
Border Region Analysis
The use of *basecov_border5.txt files indicates special handling of:
- Genome termini that may have reduced coverage
- Assembly boundaries and their coverage characteristics
- Regions requiring special consideration in downstream analysis
Quality Control and Validation
The pipeline includes several quality control measures:
Variant Quality Filters
- Coverage thresholds: Minimum 5x coverage ensures reliability
- Allele frequency: 60% MAF reduces false positives
- Edit distance filters: Removes variants likely due to alignment errors
- Strand bias assessment: Identifies potential technical artifacts
Result Validation
- Multi-sample consistency checks
- Coverage uniformity assessment
- Reference genome integrity verification
- Output file completeness validation
Customization Options
Key parameters can be modified for different analysis requirements:
Coverage Thresholds
# Increase stringency for high-quality samples
MINCOV=10
# Decrease for low-coverage samples
MINCOV=3
Variant Calling Stringency
# More stringent MAF for population studies
maf=0.8
# More permissive for outbreak investigation
maf=0.3
Alternative Reference Genomes
# Use alternative SARS-CoV-2 reference
REF="alternative_covid_ref.fa"
# Use BBTools built-in reference
REF="bbmap/resources/Covid19_ref.fa"
Integration with COVID-19 Workflows
This script is designed as the final step in a comprehensive COVID-19 analysis workflow:
- Individual processing: Run processCorona.sh for each sample
- Quality assessment: Review individual sample results
- Summary generation: Run makeSummary.sh to aggregate results
- Population analysis: Use allVars.vcf for phylogenetic analysis
- Surveillance reporting: Generate reports from coverageSummary.txt
Troubleshooting
Common Issues
- Missing input files: Ensure all samples were processed successfully
- Memory errors: Increase -Xmx parameter if variant calling fails
- Empty VCF output: Check that MINCOV threshold is appropriate for your data
- Archive creation fails: Verify sufficient disk space for results.tar
Validation Steps
# Check input file presence
ls *_deduped_trimclip.sam.gz | wc -l
ls *basecov_border5.txt | wc -l
# Verify VCF output quality
grep -v "^#" allVars.vcf | wc -l
# Check coverage summary completeness
head -20 coverageSummary.txt
Performance Considerations
- Memory usage: Multi-sample variant calling requires 4GB+ RAM
- Processing time: Scales with number of samples and genome coverage
- Disk usage: Temporary files may require significant space
- I/O performance: Benefits from fast storage for large sample sets