processCoronaWrapper
Outer wrapper script for batch processing multiple COVID-19 samples through the comprehensive SARS-CoV-2 variant calling pipeline. This template script coordinates quality score calibration, individual sample processing via processCorona.sh, and final result summarization. Designed for high-throughput analysis of Illumina paired-end amplicon data from SARS-CoV-2 sequencing projects.
Purpose
The processCoronaWrapper.sh script serves as the master coordination script for large-scale COVID-19 sample processing. It orchestrates three key phases:
- Quality Score Calibration: Generates recalibration matrices using a representative sample to improve base quality accuracy across all samples
- Individual Sample Processing: Calls processCorona.sh for each sample to perform the complete variant calling pipeline
- Result Summarization: Aggregates results across all samples for comparative analysis and final reporting
This wrapper is specifically designed for SARS-CoV-2 amplicon sequencing data and follows best practices for viral variant calling established during the COVID-19 pandemic.
Template Structure
This script is provided as a template that must be modified before use according to your specific file naming conventions and sample list. The script includes safety checks to prevent accidental execution without customization.
Template Warning
echo "This template must be modifed before use according to your file names."
exit
These lines must be removed or commented out after customization.
Workflow Components
1. File Cleanup (Optional)
rm *.sam.gz *.bam *.bai *.txt *_genome.fa *_adapters.fa *.vcf *.vcf.gz
Removes previous pipeline output files if rerunning the analysis. This ensures clean processing without file conflicts.
2. Quality Score Calibration
sh ./recal.sh Sample1
Generates quality score recalibration matrices using a single representative sample (typically the largest). This step:
- Discovers adapter sequences from read overlap patterns
- Performs high-sensitivity mapping to the SARS-CoV-2 reference
- Identifies true variants to distinguish from sequencing errors
- Creates recalibration matrices in the ./ref directory
Note: This only needs to be run once per sequencing run, as all samples from the same run share quality score characteristics.
3. Individual Sample Processing
sh processCorona.sh Sample1
sh processCorona.sh Sample2
## etc.
Processes each sample through the complete variant calling pipeline. For each sample, processCorona.sh performs:
- Viral read separation and enrichment
- Quality score recalibration using matrices from step 2
- Adapter discovery and trimming
- Duplicate removal and quality filtering
- Primer trimming (ARTIC3 protocol)
- Reference alignment with local mode for amplicon data
- Mapping-based deduplication
- Variant calling with amplicon-specific parameters
- Consensus genome generation
4. Result Summarization
sh makeSummary.sh 1>makeSummary.o 2>&1
Aggregates results from all processed samples to generate:
- Multi-sample variant calling results
- Coverage summaries across all libraries
- Consolidated output directory
- Compressed results archive (results.tar)
Customization Requirements
Before using this wrapper, you must modify it according to your data:
Sample List Configuration
Replace the template sample processing lines with your actual sample names:
# Replace these template lines:
sh processCorona.sh Sample1
sh processCorona.sh Sample2
# With your actual sample names:
sh processCorona.sh COVID_001
sh processCorona.sh COVID_002
sh processCorona.sh COVID_003
# ... etc.
Alternative: Loop-Based Processing
For many samples, consider using a loop instead of individual lines:
# Example for samples named COVID_001.fq.gz through COVID_100.fq.gz
for i in {1..100}; do
sample=$(printf "COVID_%03d" $i)
sh processCorona.sh $sample
done
File Naming Assumptions
The wrapper assumes files are named according to this pattern:
SampleName.fq.gz
- Interleaved paired-end files- If your data is in twin files (
Sample_R1.fq.gz
andSample_R2.fq.gz
), uncomment the reformat line in processCorona.sh
Pipeline Dependencies
This wrapper coordinates three additional scripts that must be present in the same directory:
recal.sh
Creates quality score recalibration matrices. Requires:
- SARS-CoV-2 reference genome (NC_045512.fasta)
- Sufficient disk space for temporary files
- At least 6GB RAM for high-sensitivity mapping
processCorona.sh
Core variant calling pipeline for individual samples. Requires:
- SARS-CoV-2 reference genome (NC_045512.fasta)
- ARTIC3 primer sequences (artic3.fasta)
- Recalibration matrices in ./ref directory
- Up to 31GB RAM for deduplication step
makeSummary.sh
Summarizes results across all processed samples. Requires:
- Output files from all processCorona.sh runs
- Sufficient disk space for consolidated output
Required Reference Files
SARS-CoV-2 Reference Genome
- NC_045512.fasta
- The SARS-CoV-2 reference genome sequence. Equivalent to bbmap/resources/Covid19_ref.fa. Must be present in the working directory.
PCR Primers
- artic3.fasta
- ARTIC version 3 primer sequences for amplicon trimming. Equivalent to bbmap/resources/artic3.fasta. Required for proper primer removal.
Memory and Performance Requirements
System Requirements
- RAM: Minimum 32GB recommended for large samples (clumpify deduplication)
- CPU: Multi-core recommended for parallel processing
- Storage: 10-50GB per sample for intermediate files
- I/O: Fast storage recommended for temporary files
Processing Time Estimates
- Calibration: 30-60 minutes per calibration sample
- Per Sample: 15-45 minutes depending on read count and depth
- Summarization: 5-15 minutes for typical batch sizes
Usage Examples
Basic Batch Processing
# 1. Prepare reference files
cp path/to/NC_045512.fasta .
cp path/to/artic3.fasta .
# 2. Customize the wrapper script
nano processCoronaWrapper.sh
# Remove the template warning lines
# Add your sample names
# 3. Run the complete pipeline
bash processCoronaWrapper.sh
# 4. Results will be in output/ directory and results.tar
Large-Scale Processing with Loop
# Modified wrapper for 100+ samples
#!/bin/bash
# Quality score calibration (run once)
sh ./recal.sh Sample001
# Process all samples
for sample in Sample{001..150}; do
if [ -f "${sample}.fq.gz" ]; then
echo "Processing $sample..."
sh processCorona.sh $sample
fi
done
# Summarize results
sh makeSummary.sh 1>makeSummary.o 2>&1
Output Files
The wrapper generates comprehensive output for each processed sample plus summary files:
Per-Sample Outputs
SampleName_genome.fa
- Consensus genome sequenceSampleName_vars.vcf
- Called variants in VCF formatSampleName_basecov_border5.txt
- Per-base coverage dataSampleName_sorted.bam/.bai
- Sorted alignments (if enabled)
Summary Outputs
allVars.vcf
- Multi-sample variant callscoverageSummary.txt
- Coverage statistics across all samplesoutput/
- Directory containing all key resultsresults.tar
- Compressed archive of all results
Algorithm Details
This wrapper implements a comprehensive SARS-CoV-2 variant calling strategy optimized for amplicon sequencing data:
Quality Score Recalibration Strategy
The pipeline begins with empirical quality score recalibration to correct for systematic biases in Illumina quality scores:
- High-sensitivity mapping identifies true variants versus sequencing errors
- Recalibration matrices model position-specific and sequence-context-specific quality biases
- Shared matrices across samples from the same run ensure consistency
Amplicon-Specific Processing
The pipeline includes several adaptations for PCR-amplified data:
- Local Alignment: Accommodates primer-induced anomalies at read ends
- Primer Trimming: ARTIC3-specific primer removal with position restrictions
- Strand Bias Handling: Disables strand bias filters (usebias=f) due to amplicon artifacts
- Duplicate Strategy: Two-stage deduplication (sequence-based then mapping-based)
Variant Calling Parameters
Variant calling uses parameters optimized for viral amplicon data:
- Minimum Coverage: 5x depth threshold for reliable calls
- Minimum Allele Frequency: 60% (maf=0.6) for consensus calling
- Edge Distance Filtering: Excludes variants near read ends (minedist=16, minedistmax=30)
- Nearby Variant Flagging: Identifies potential sequencing artifacts
Troubleshooting
Common Issues
- Script exits with template warning
- Remove or comment out the "echo" and "exit" lines after customizing sample names.
- Out of memory errors
- Reduce the -Xmx values in processCorona.sh, particularly for clumpify.sh (currently 31g).
- Missing reference files
- Ensure NC_045512.fasta and artic3.fasta are present in the working directory.
- Adapter discovery failures
- For libraries with few short inserts, comment out adapter discovery and use default Illumina adapters.
- Low coverage warnings
- Adjust MINCOV parameter in the scripts if your sequencing depth is consistently different from 5x.
Related Tools
- processCorona.sh: Individual sample processing pipeline
- recal.sh: Quality score recalibration matrix generation
- makeSummary.sh: Multi-sample result summarization
- callvariants.sh: Core variant calling functionality
- bbduk.sh: Trimming and filtering operations
- bbmap.sh: Reference alignment
- clumpify.sh: Duplicate removal
Citation
If you use this COVID-19 processing pipeline in your research, please cite:
Bushnell, B. (2020). BBTools: A suite of fast, multithreaded bioinformatics tools for sequence analysis. Joint Genome Institute.