processCoronaWrapper

Script: processCoronaWrapper.sh Source Directory: pipelines/covid/ Author: Brian Bushnell

Outer wrapper script for batch processing multiple COVID-19 samples through the comprehensive SARS-CoV-2 variant calling pipeline. This template script coordinates quality score calibration, individual sample processing via processCorona.sh, and final result summarization. Designed for high-throughput analysis of Illumina paired-end amplicon data from SARS-CoV-2 sequencing projects.

Purpose

The processCoronaWrapper.sh script serves as the master coordination script for large-scale COVID-19 sample processing. It orchestrates three key phases:

This wrapper is specifically designed for SARS-CoV-2 amplicon sequencing data and follows best practices for viral variant calling established during the COVID-19 pandemic.

Template Structure

This script is provided as a template that must be modified before use according to your specific file naming conventions and sample list. The script includes safety checks to prevent accidental execution without customization.

Template Warning

echo "This template must be modifed before use according to your file names."
exit

These lines must be removed or commented out after customization.

Workflow Components

1. File Cleanup (Optional)

rm *.sam.gz *.bam *.bai *.txt *_genome.fa *_adapters.fa *.vcf *.vcf.gz

Removes previous pipeline output files if rerunning the analysis. This ensures clean processing without file conflicts.

2. Quality Score Calibration

sh ./recal.sh Sample1

Generates quality score recalibration matrices using a single representative sample (typically the largest). This step:

  • Discovers adapter sequences from read overlap patterns
  • Performs high-sensitivity mapping to the SARS-CoV-2 reference
  • Identifies true variants to distinguish from sequencing errors
  • Creates recalibration matrices in the ./ref directory

Note: This only needs to be run once per sequencing run, as all samples from the same run share quality score characteristics.

3. Individual Sample Processing

sh processCorona.sh Sample1
sh processCorona.sh Sample2
##  etc.

Processes each sample through the complete variant calling pipeline. For each sample, processCorona.sh performs:

  • Viral read separation and enrichment
  • Quality score recalibration using matrices from step 2
  • Adapter discovery and trimming
  • Duplicate removal and quality filtering
  • Primer trimming (ARTIC3 protocol)
  • Reference alignment with local mode for amplicon data
  • Mapping-based deduplication
  • Variant calling with amplicon-specific parameters
  • Consensus genome generation

4. Result Summarization

sh makeSummary.sh 1>makeSummary.o 2>&1

Aggregates results from all processed samples to generate:

  • Multi-sample variant calling results
  • Coverage summaries across all libraries
  • Consolidated output directory
  • Compressed results archive (results.tar)

Customization Requirements

Before using this wrapper, you must modify it according to your data:

Sample List Configuration

Replace the template sample processing lines with your actual sample names:

# Replace these template lines:
sh processCorona.sh Sample1
sh processCorona.sh Sample2

# With your actual sample names:
sh processCorona.sh COVID_001
sh processCorona.sh COVID_002
sh processCorona.sh COVID_003
# ... etc.

Alternative: Loop-Based Processing

For many samples, consider using a loop instead of individual lines:

# Example for samples named COVID_001.fq.gz through COVID_100.fq.gz
for i in {1..100}; do
    sample=$(printf "COVID_%03d" $i)
    sh processCorona.sh $sample
done

File Naming Assumptions

The wrapper assumes files are named according to this pattern:

  • SampleName.fq.gz - Interleaved paired-end files
  • If your data is in twin files (Sample_R1.fq.gz and Sample_R2.fq.gz), uncomment the reformat line in processCorona.sh

Pipeline Dependencies

This wrapper coordinates three additional scripts that must be present in the same directory:

recal.sh

Creates quality score recalibration matrices. Requires:

  • SARS-CoV-2 reference genome (NC_045512.fasta)
  • Sufficient disk space for temporary files
  • At least 6GB RAM for high-sensitivity mapping

processCorona.sh

Core variant calling pipeline for individual samples. Requires:

  • SARS-CoV-2 reference genome (NC_045512.fasta)
  • ARTIC3 primer sequences (artic3.fasta)
  • Recalibration matrices in ./ref directory
  • Up to 31GB RAM for deduplication step

makeSummary.sh

Summarizes results across all processed samples. Requires:

  • Output files from all processCorona.sh runs
  • Sufficient disk space for consolidated output

Required Reference Files

SARS-CoV-2 Reference Genome

NC_045512.fasta
The SARS-CoV-2 reference genome sequence. Equivalent to bbmap/resources/Covid19_ref.fa. Must be present in the working directory.

PCR Primers

artic3.fasta
ARTIC version 3 primer sequences for amplicon trimming. Equivalent to bbmap/resources/artic3.fasta. Required for proper primer removal.

Memory and Performance Requirements

System Requirements

  • RAM: Minimum 32GB recommended for large samples (clumpify deduplication)
  • CPU: Multi-core recommended for parallel processing
  • Storage: 10-50GB per sample for intermediate files
  • I/O: Fast storage recommended for temporary files

Processing Time Estimates

  • Calibration: 30-60 minutes per calibration sample
  • Per Sample: 15-45 minutes depending on read count and depth
  • Summarization: 5-15 minutes for typical batch sizes

Usage Examples

Basic Batch Processing

# 1. Prepare reference files
cp path/to/NC_045512.fasta .
cp path/to/artic3.fasta .

# 2. Customize the wrapper script
nano processCoronaWrapper.sh
# Remove the template warning lines
# Add your sample names

# 3. Run the complete pipeline
bash processCoronaWrapper.sh

# 4. Results will be in output/ directory and results.tar

Large-Scale Processing with Loop

# Modified wrapper for 100+ samples
#!/bin/bash

# Quality score calibration (run once)
sh ./recal.sh Sample001

# Process all samples
for sample in Sample{001..150}; do
    if [ -f "${sample}.fq.gz" ]; then
        echo "Processing $sample..."
        sh processCorona.sh $sample
    fi
done

# Summarize results
sh makeSummary.sh 1>makeSummary.o 2>&1

Output Files

The wrapper generates comprehensive output for each processed sample plus summary files:

Per-Sample Outputs

  • SampleName_genome.fa - Consensus genome sequence
  • SampleName_vars.vcf - Called variants in VCF format
  • SampleName_basecov_border5.txt - Per-base coverage data
  • SampleName_sorted.bam/.bai - Sorted alignments (if enabled)

Summary Outputs

  • allVars.vcf - Multi-sample variant calls
  • coverageSummary.txt - Coverage statistics across all samples
  • output/ - Directory containing all key results
  • results.tar - Compressed archive of all results

Algorithm Details

This wrapper implements a comprehensive SARS-CoV-2 variant calling strategy optimized for amplicon sequencing data:

Quality Score Recalibration Strategy

The pipeline begins with empirical quality score recalibration to correct for systematic biases in Illumina quality scores:

  • High-sensitivity mapping identifies true variants versus sequencing errors
  • Recalibration matrices model position-specific and sequence-context-specific quality biases
  • Shared matrices across samples from the same run ensure consistency

Amplicon-Specific Processing

The pipeline includes several adaptations for PCR-amplified data:

  • Local Alignment: Accommodates primer-induced anomalies at read ends
  • Primer Trimming: ARTIC3-specific primer removal with position restrictions
  • Strand Bias Handling: Disables strand bias filters (usebias=f) due to amplicon artifacts
  • Duplicate Strategy: Two-stage deduplication (sequence-based then mapping-based)

Variant Calling Parameters

Variant calling uses parameters optimized for viral amplicon data:

  • Minimum Coverage: 5x depth threshold for reliable calls
  • Minimum Allele Frequency: 60% (maf=0.6) for consensus calling
  • Edge Distance Filtering: Excludes variants near read ends (minedist=16, minedistmax=30)
  • Nearby Variant Flagging: Identifies potential sequencing artifacts

Troubleshooting

Common Issues

Script exits with template warning
Remove or comment out the "echo" and "exit" lines after customizing sample names.
Out of memory errors
Reduce the -Xmx values in processCorona.sh, particularly for clumpify.sh (currently 31g).
Missing reference files
Ensure NC_045512.fasta and artic3.fasta are present in the working directory.
Adapter discovery failures
For libraries with few short inserts, comment out adapter discovery and use default Illumina adapters.
Low coverage warnings
Adjust MINCOV parameter in the scripts if your sequencing depth is consistently different from 5x.

Related Tools

Citation

If you use this COVID-19 processing pipeline in your research, please cite:

Bushnell, B. (2020). BBTools: A suite of fast, multithreaded bioinformatics tools for sequence analysis. Joint Genome Institute.