Process Corona Pipeline
Comprehensive SARS-CoV-2 variant calling pipeline specifically designed for Illumina amplicon data. The pipeline performs viral read separation, quality filtering, adapter and primer trimming, alignment, deduplication, variant calling, and consensus genome generation with automated quality control measures.
Overview
This pipeline is specifically designed for calling SARS-CoV-2 variants from Illumina amplicon sequencing data. It assumes paired-end input data and implements a comprehensive workflow that includes viral read separation, quality control, primer trimming, alignment optimization, and consensus genome generation. The pipeline is optimized for the ARTIC v3 primer scheme but can be adapted for other amplicon protocols.
Prerequisites
System Requirements
- BBTools suite installed
- Sufficient memory (minimum 4GB, recommended 32GB for large datasets)
- samtools (optional, for BAM file generation)
Input Requirements
- Paired-end Illumina FASTQ files (can be interleaved or separate R1/R2 files)
- SARS-CoV-2 reference genome (NC_045512.fasta or equivalent)
- ARTIC v3 primer sequences (artic3.fasta)
- Quality recalibration matrix (optional but recommended)
Configuration Variables
- MINCOV=5: Minimum coverage depth for genotype calls (adjustable)
- REF="NC_045512.fasta": SARS-CoV-2 reference genome file
- PRIMERS="artic3.fasta": PCR primer sequences file
Pipeline Stages
1. Data Preparation
1.1 File Cleanup
rm "$NAME"*.sam.gz "$NAME"*.bam "$NAME"*.bai "$NAME"*.txt "$NAME"*.fa "$NAME"*.vcf
Removes any existing output files from previous runs to prevent conflicts and ensure clean execution.
1.2 Read Interleaving (Optional)
# If using separate R1/R2 files:
# reformat.sh in="$NAME"_R#.fq.gz out="$NAME".fq.gz
Interleaves paired-end reads if they are in separate files. Skip this step if reads are already interleaved.
2. Viral Read Separation
2.1 COVID-19 vs Non-COVID Read Classification
bbduk.sh ow -Xmx1g in="$NAME".fq.gz ref="$REF" outm="$NAME"_viral.fq.gz outu="$NAME"_nonviral.fq.gz k=25
Separates reads into viral and non-viral categories using k-mer matching (k=25) against the SARS-CoV-2 reference. This step significantly reduces computational requirements for downstream processing and improves specificity.
3. Quality Control and Preprocessing
3.1 Quality Score Recalibration
bbduk.sh in="$NAME"_viral.fq.gz out="$NAME"_recal.fq.gz recalibrate -Xmx1g ow
Requires recalibration matrix in working directory. Recalibrates quality scores to improve accuracy, particularly beneficial for Illumina binned quality scores. This step is optional but recommended for accurate variant calling.
3.2 Adapter Detection
bbmerge.sh in="$NAME"_recal.fq.gz outa="$NAME"_adapters.fa ow reads=1m
Automatically discovers adapter sequences for this specific library based on read overlap patterns. Uses 1 million reads for detection. If insufficient short-insert pairs are found, the step will fail and default Illumina adapters should be used instead.
3.3 Duplicate Removal
clumpify.sh in="$NAME"_recal.fq.gz out="$NAME"_clumped.fq.gz zl=9 dedupe s=2 passes=4 -Xmx31g
Removes duplicates by sequence similarity (more memory-efficient than mapping-based deduplication). Uses 4 passes with similarity threshold of 2, producing highly compressed output (zl=9).
4. Adapter and Quality Trimming
4.1 Adapter and Quality Trimming
bbduk.sh in="$NAME"_clumped.fq.gz out="$NAME"_trimmed.fq.gz minlen=60 ktrim=r k=21 mink=9 hdist=2 hdist2=1 ref="$NAME"_adapters.fa altref=adapters maq=14 qtrim=r trimq=10 maxns=0 tbo tpe ow -Xmx1g ftm=5
Comprehensive trimming including:
- Adapter trimming: k=21, mink=9, allowing 2 mismatches (hdist=2)
- Quality trimming: Right end trimming at Q10 (qtrim=r trimq=10)
- Length filtering: Minimum length 60bp after trimming
- Quality filtering: Minimum average quality 14 (maq=14)
- N filtering: No Ns allowed (maxns=0)
- Paired-end handling: tbo (trim by overlap), tpe (trim paired ends)
- Force trim: ftm=5 removes first 5 bases if needed
4.2 Primer Trimming
bbduk.sh in="$NAME"_trimmed.fq.gz out="$NAME"_trimmed2.fq.gz ref="$PRIMERS" ktrim=l restrictleft=30 k=22 hdist=3 qhdist=1 rcomp=f mm=f
Trims ARTIC v3 primers from the left end of reads:
- Left trimming: ktrim=l removes primers from 5' end
- Search window: restrictleft=30 only checks first 30bp
- K-mer matching: k=22 with 3 mismatches allowed (hdist=3)
- Quality-aware: qhdist=1 for quality-weighted distance
- No reverse complement: rcomp=f (primers are oriented)
- No masking: mm=f removes rather than masks matches
Note: Disable this step if not using primer amplification.
5. Alignment and Mapping
5.1 Reference Alignment
bbmap.sh ref="$REF" in="$NAME"_trimmed2.fq.gz outm="$NAME"_mapped.sam.gz nodisk local maxindel=500 -Xmx4g ow k=12
Aligns reads to SARS-CoV-2 reference genome with amplicon-specific settings:
- Local alignment: Handles primer-induced end anomalies (omit for unamplified data)
- Large indel support: maxindel=500 for structural variants
- Sensitive mapping: k=12 for high sensitivity
- Memory efficient: nodisk avoids index caching
6. Post-Alignment Processing
6.1 Coordinate-Based Deduplication
dedupebymapping.sh in="$NAME"_mapped.sam.gz out="$NAME"_deduped.sam.gz -Xmx31g ow
Removes duplicates based on mapping coordinates and strand. Warning: Single-ended amplicon data will lose most reads in this step due to identical start positions.
6.2 Chimeric Read Filtering - Phase 1
filtersam.sh ref="$REF" ow in="$NAME"_deduped.sam.gz out="$NAME"_filtered.sam.gz mbad=1 del sub=f mbv=0 -Xmx4g
Removes reads with unsupported unique deletions (often chimeric artifacts):
- mbad=1: One or more bad variations triggers removal
- del: Focus on deletions
- sub=f: Ignore substitutions in this pass
- mbv=0: No minimum bad variation threshold
6.3 Chimeric Read Filtering - Phase 2
filtersam.sh ref="$REF" ow in="$NAME"_filtered.sam.gz out="$NAME"_filtered2.sam.gz mbad=1 sub mbv=2 -Xmx4g
Removes reads with multiple unsupported substitutions (reduces NovaSeq noise):
- mbad=1: One or more bad variations triggers removal
- sub: Focus on substitutions
- mbv=2: Requires 2+ bad variations for removal
6.4 Soft-Clip Trimming
bbduk.sh in="$NAME"_filtered2.sam.gz trimclip out="$NAME"_trimclip.sam.gz -Xmx1g ow
Removes soft-clipped bases from alignments to improve variant calling accuracy by eliminating poorly aligned end sequences.
7. Variant Calling and Consensus Generation
7.1 Variant Detection
callvariants.sh in="$NAME"_trimclip.sam.gz ref="$REF" out="$NAME"_vars.vcf -Xmx4g ow strandedcov usebias=f minstrandratio=0 maf=0.6 minreads="$MINCOV" mincov="$MINCOV" minedistmax=30 minedist=16 flagnearby
Calls variants with amplicon-specific parameters:
- strandedcov: Reports strand-specific coverage
- usebias=f, minstrandratio=0: Disables strand bias filtering (necessary for amplicon data)
- maf=0.6: Minor allele frequency threshold of 60%
- minreads/mincov=MINCOV: Minimum coverage for calling (default: 5)
- minedistmax=30, minedist=16: Edge distance filtering
- flagnearby: Flags variants near each other
Note: Remove usebias=f and minstrandratio=0 for shotgun/metagenomic data.
7.2 Coverage Analysis
pileup.sh in="$NAME"_trimclip.sam.gz basecov="$NAME"_basecov_border5.txt -Xmx4g ow border=5
Calculates per-base coverage ignoring the outermost 5bp of reads (matching CallVariants defaults). This provides accurate coverage estimates for consensus generation.
7.3 Consensus Genome Generation
applyvariants.sh in="$REF" out="$NAME"_genome.fa vcf="$NAME"_vars.vcf basecov="$NAME"_basecov_border5.txt ow mindepth="$MINCOV"
Generates the consensus genome by applying detected variants to the reference:
- Applies all variants from the VCF file
- Sets positions below MINCOV coverage to 'N'
- Does not apply indels below MINCOV coverage
- Produces reference-guided assembly of the sample strain
8. Optional Visualization Preparation
8.1 BAM File Generation (Optional)
# Requires samtools installation
# samtools view -bShu "$NAME"_trimclip.sam.gz | samtools sort -m 2G -@ 3 - -o "$NAME"_sorted.bam
# samtools index "$NAME"_sorted.bam
Creates sorted and indexed BAM files for visualization in IGV or other genome browsers. These files, along with the reference and VCF, provide complete visualization capability.
Basic Usage
# Basic usage with sample prefix
processCorona.sh Sample1
# This assumes input files named:
# Sample1.fq.gz (interleaved) OR Sample1_R1.fq.gz and Sample1_R2.fq.gz (separate)
# Required files in working directory:
# - NC_045512.fasta (SARS-CoV-2 reference)
# - artic3.fasta (ARTIC v3 primers)
# - recalibration matrix (optional but recommended)
Configuration Parameters
Key parameters that can be modified in the script:
- MINCOV: Minimum coverage for consensus calling (default: 5)
- REF: Reference genome file (default: NC_045512.fasta)
- PRIMERS: Primer file (default: artic3.fasta)
Memory Requirements
- clumpify.sh: 31GB for deduplication
- dedupebymapping.sh: 31GB for coordinate-based deduplication
- Other tools: 1-4GB each
Output Files
The pipeline generates the following key output files:
- [NAME]_genome.fa: Consensus genome sequence with variants applied
- [NAME]_vars.vcf: Detected variants in VCF format
- [NAME]_basecov_border5.txt: Per-base coverage information
- [NAME]_viral.fq.gz: Reads classified as viral
- [NAME]_nonviral.fq.gz: Reads classified as non-viral
- [NAME]_trimclip.sam.gz: Final processed alignments
- [NAME]_adapters.fa: Detected adapter sequences
- [NAME]_sorted.bam/.bai: Sorted BAM and index (if samtools steps enabled)
Quality Control Considerations
Amplicon-Specific Optimizations
- Strand bias handling: Disabled due to primer-induced bias
- Local alignment: Handles primer artifacts at read ends
- Primer trimming: Removes ARTIC v3 primers from 5' ends
- Edge distance filtering: Accounts for amplicon boundaries
Coverage and Variant Calling
- Minimum coverage: Configurable threshold for reliable calls
- High MAF threshold: 60% to call variants in viral population
- N-masking: Low-coverage regions marked as ambiguous
- Quality filtering: Multiple stages ensure high-quality input
Troubleshooting
Common Issues
- Adapter detection fails: Use default adapters by changing ref parameter to "adapters"
- Low viral read count: Check input data quality and viral content
- Memory issues: Reduce memory allocation parameters if system has limited RAM
- No recalibration matrix: Skip recalibration step or generate matrix using recal.sh
Adapting for Different Data Types
- Shotgun data: Remove local alignment flag, enable strand bias filtering
- Different primers: Replace artic3.fasta with appropriate primer file
- Single-end data: Skip or modify deduplication steps
- Different viral targets: Replace reference genome and adjust parameters
Pipeline Validation
For visualization and validation in IGV:
- Load the reference genome: NC_045512.fasta
- Load the sorted BAM file: [NAME]_sorted.bam
- Load the variants file: [NAME]_vars.vcf
- Examine coverage patterns and variant calls
- Validate consensus sequence: [NAME]_genome.fa