processCoronaWrapper

Script: processCoronaWrapper.sh Source Directory: pipelines/covid/ Author: Brian Bushnell

Outer wrapper script for batch processing multiple COVID-19 samples through the comprehensive SARS-CoV-2 variant calling pipeline. This template script coordinates quality score calibration, individual sample processing via processCorona.sh, and final result summarization. Designed for high-throughput analysis of Illumina paired-end amplicon data from SARS-CoV-2 sequencing projects.

Purpose

The processCoronaWrapper.sh script serves as the master coordination script for large-scale COVID-19 sample processing. It orchestrates three key phases:

Quality Score Calibration: Generates recalibration matrices using a representative sample to improve base quality accuracy across all samples
Individual Sample Processing: Calls processCorona.sh for each sample to perform the complete variant calling pipeline
Result Summarization: Aggregates results across all samples for comparative analysis and final reporting

This wrapper is specifically designed for SARS-CoV-2 amplicon sequencing data and follows best practices for viral variant calling established during the COVID-19 pandemic.

Template Structure

This script is provided as a template that must be modified before use according to your specific file naming conventions and sample list. The script includes safety checks to prevent accidental execution without customization.

Template Warning

echo "This template must be modifed before use according to your file names."
exit

These lines must be removed or commented out after customization.

Workflow Components

1. File Cleanup (Optional)

rm *.sam.gz *.bam *.bai *.txt *_genome.fa *_adapters.fa *.vcf *.vcf.gz

Removes previous pipeline output files if rerunning the analysis. This ensures clean processing without file conflicts.

2. Quality Score Calibration

sh ./recal.sh Sample1

Generates quality score recalibration matrices using a single representative sample (typically the largest). This step:

Discovers adapter sequences from read overlap patterns
Performs high-sensitivity mapping to the SARS-CoV-2 reference
Identifies true variants to distinguish from sequencing errors
Creates recalibration matrices in the ./ref directory

Note: This only needs to be run once per sequencing run, as all samples from the same run share quality score characteristics.

3. Individual Sample Processing

sh processCorona.sh Sample1
sh processCorona.sh Sample2
##  etc.

Processes each sample through the complete variant calling pipeline. For each sample, processCorona.sh performs:

Viral read separation and enrichment
Quality score recalibration using matrices from step 2
Adapter discovery and trimming
Duplicate removal and quality filtering
Primer trimming (ARTIC3 protocol)
Reference alignment with local mode for amplicon data
Mapping-based deduplication
Variant calling with amplicon-specific parameters
Consensus genome generation

4. Result Summarization

sh makeSummary.sh 1>makeSummary.o 2>&1

Aggregates results from all processed samples to generate:

Multi-sample variant calling results
Coverage summaries across all libraries
Consolidated output directory
Compressed results archive (results.tar)

Customization Requirements

Before using this wrapper, you must modify it according to your data:

Sample List Configuration

Replace the template sample processing lines with your actual sample names:

# Replace these template lines:
sh processCorona.sh Sample1
sh processCorona.sh Sample2

# With your actual sample names:
sh processCorona.sh COVID_001
sh processCorona.sh COVID_002
sh processCorona.sh COVID_003
# ... etc.

Alternative: Loop-Based Processing

For many samples, consider using a loop instead of individual lines:

# Example for samples named COVID_001.fq.gz through COVID_100.fq.gz
for i in {1..100}; do
    sample=$(printf "COVID_%03d" $i)
    sh processCorona.sh $sample
done

File Naming Assumptions

The wrapper assumes files are named according to this pattern:

SampleName.fq.gz - Interleaved paired-end files
If your data is in twin files (Sample_R1.fq.gz and Sample_R2.fq.gz), uncomment the reformat line in processCorona.sh

Pipeline Dependencies

This wrapper coordinates three additional scripts that must be present in the same directory:

recal.sh

Creates quality score recalibration matrices. Requires:

SARS-CoV-2 reference genome (NC_045512.fasta)
Sufficient disk space for temporary files
At least 6GB RAM for high-sensitivity mapping

processCorona.sh

Core variant calling pipeline for individual samples. Requires:

SARS-CoV-2 reference genome (NC_045512.fasta)
ARTIC3 primer sequences (artic3.fasta)
Recalibration matrices in ./ref directory
Up to 31GB RAM for deduplication step

makeSummary.sh

Summarizes results across all processed samples. Requires:

Output files from all processCorona.sh runs
Sufficient disk space for consolidated output

Required Reference Files

SARS-CoV-2 Reference Genome

NC_045512.fasta: The SARS-CoV-2 reference genome sequence. Equivalent to bbmap/resources/Covid19_ref.fa. Must be present in the working directory.

PCR Primers

artic3.fasta: ARTIC version 3 primer sequences for amplicon trimming. Equivalent to bbmap/resources/artic3.fasta. Required for proper primer removal.

Memory and Performance Requirements

System Requirements

RAM: Minimum 32GB recommended for large samples (clumpify deduplication)
CPU: Multi-core recommended for parallel processing
Storage: 10-50GB per sample for intermediate files
I/O: Fast storage recommended for temporary files

Processing Time Estimates

Calibration: 30-60 minutes per calibration sample
Per Sample: 15-45 minutes depending on read count and depth
Summarization: 5-15 minutes for typical batch sizes

Usage Examples

Basic Batch Processing

# 1. Prepare reference files
cp path/to/NC_045512.fasta .
cp path/to/artic3.fasta .

# 2. Customize the wrapper script
nano processCoronaWrapper.sh
# Remove the template warning lines
# Add your sample names

# 3. Run the complete pipeline
bash processCoronaWrapper.sh

# 4. Results will be in output/ directory and results.tar

Large-Scale Processing with Loop

# Modified wrapper for 100+ samples
#!/bin/bash

# Quality score calibration (run once)
sh ./recal.sh Sample001

# Process all samples
for sample in Sample{001..150}; do
    if [ -f "${sample}.fq.gz" ]; then
        echo "Processing $sample..."
        sh processCorona.sh $sample
    fi
done

# Summarize results
sh makeSummary.sh 1>makeSummary.o 2>&1

Output Files

The wrapper generates comprehensive output for each processed sample plus summary files:

Per-Sample Outputs

SampleName_genome.fa - Consensus genome sequence
SampleName_vars.vcf - Called variants in VCF format
SampleName_basecov_border5.txt - Per-base coverage data
SampleName_sorted.bam/.bai - Sorted alignments (if enabled)

Summary Outputs

allVars.vcf - Multi-sample variant calls
coverageSummary.txt - Coverage statistics across all samples
output/ - Directory containing all key results
results.tar - Compressed archive of all results

Algorithm Details

This wrapper implements a comprehensive SARS-CoV-2 variant calling strategy optimized for amplicon sequencing data:

Quality Score Recalibration Strategy

The pipeline begins with empirical quality score recalibration to correct for systematic biases in Illumina quality scores:

High-sensitivity mapping identifies true variants versus sequencing errors
Recalibration matrices model position-specific and sequence-context-specific quality biases
Shared matrices across samples from the same run ensure consistency

Amplicon-Specific Processing

The pipeline includes several adaptations for PCR-amplified data:

Local Alignment: Accommodates primer-induced anomalies at read ends
Primer Trimming: ARTIC3-specific primer removal with position restrictions
Strand Bias Handling: Disables strand bias filters (usebias=f) due to amplicon artifacts
Duplicate Strategy: Two-stage deduplication (sequence-based then mapping-based)

Variant Calling Parameters

Variant calling uses parameters optimized for viral amplicon data:

Minimum Coverage: 5x depth threshold for reliable calls
Minimum Allele Frequency: 60% (maf=0.6) for consensus calling
Edge Distance Filtering: Excludes variants near read ends (minedist=16, minedistmax=30)
Nearby Variant Flagging: Identifies potential sequencing artifacts

Troubleshooting

Common Issues

Script exits with template warning: Remove or comment out the "echo" and "exit" lines after customizing sample names.
Out of memory errors: Reduce the -Xmx values in processCorona.sh, particularly for clumpify.sh (currently 31g).
Missing reference files: Ensure NC_045512.fasta and artic3.fasta are present in the working directory.
Adapter discovery failures: For libraries with few short inserts, comment out adapter discovery and use default Illumina adapters.
Low coverage warnings: Adjust MINCOV parameter in the scripts if your sequencing depth is consistently different from 5x.

Related Tools

processCorona.sh: Individual sample processing pipeline
recal.sh: Quality score recalibration matrix generation
makeSummary.sh: Multi-sample result summarization
callvariants.sh: Core variant calling functionality
bbduk.sh: Trimming and filtering operations
bbmap.sh: Reference alignment
clumpify.sh: Duplicate removal

Citation

If you use this COVID-19 processing pipeline in your research, please cite:

Bushnell, B. (2020). BBTools: A suite of fast, multithreaded bioinformatics tools for sequence analysis. Joint Genome Institute.