Process Corona Pipeline

Script: processCorona.sh Source Directory: pipelines/covid/ Author: Brian Bushnell Last Modified: April 11, 2020

Comprehensive SARS-CoV-2 variant calling pipeline specifically designed for Illumina amplicon data. The pipeline performs viral read separation, quality filtering, adapter and primer trimming, alignment, deduplication, variant calling, and consensus genome generation with automated quality control measures.

Overview

This pipeline is specifically designed for calling SARS-CoV-2 variants from Illumina amplicon sequencing data. It assumes paired-end input data and implements a comprehensive workflow that includes viral read separation, quality control, primer trimming, alignment optimization, and consensus genome generation. The pipeline is optimized for the ARTIC v3 primer scheme but can be adapted for other amplicon protocols.

Note: This pipeline is specifically designed for amplicon data and includes special handling for primer-induced strand bias and amplicon-specific artifacts. For shotgun/metagenomic data, several parameters would need adjustment.

Prerequisites

System Requirements

Input Requirements

Configuration Variables

Pipeline Stages

1. Data Preparation

1.1 File Cleanup

rm "$NAME"*.sam.gz "$NAME"*.bam "$NAME"*.bai "$NAME"*.txt "$NAME"*.fa "$NAME"*.vcf

Removes any existing output files from previous runs to prevent conflicts and ensure clean execution.

1.2 Read Interleaving (Optional)

# If using separate R1/R2 files:
# reformat.sh in="$NAME"_R#.fq.gz out="$NAME".fq.gz

Interleaves paired-end reads if they are in separate files. Skip this step if reads are already interleaved.

2. Viral Read Separation

2.1 COVID-19 vs Non-COVID Read Classification

bbduk.sh ow -Xmx1g in="$NAME".fq.gz ref="$REF" outm="$NAME"_viral.fq.gz outu="$NAME"_nonviral.fq.gz k=25

Separates reads into viral and non-viral categories using k-mer matching (k=25) against the SARS-CoV-2 reference. This step significantly reduces computational requirements for downstream processing and improves specificity.

3. Quality Control and Preprocessing

3.1 Quality Score Recalibration

bbduk.sh in="$NAME"_viral.fq.gz out="$NAME"_recal.fq.gz recalibrate -Xmx1g ow

Requires recalibration matrix in working directory. Recalibrates quality scores to improve accuracy, particularly beneficial for Illumina binned quality scores. This step is optional but recommended for accurate variant calling.

3.2 Adapter Detection

bbmerge.sh in="$NAME"_recal.fq.gz outa="$NAME"_adapters.fa ow reads=1m

Automatically discovers adapter sequences for this specific library based on read overlap patterns. Uses 1 million reads for detection. If insufficient short-insert pairs are found, the step will fail and default Illumina adapters should be used instead.

3.3 Duplicate Removal

clumpify.sh in="$NAME"_recal.fq.gz out="$NAME"_clumped.fq.gz zl=9 dedupe s=2 passes=4 -Xmx31g

Removes duplicates by sequence similarity (more memory-efficient than mapping-based deduplication). Uses 4 passes with similarity threshold of 2, producing highly compressed output (zl=9).

4. Adapter and Quality Trimming

4.1 Adapter and Quality Trimming

bbduk.sh in="$NAME"_clumped.fq.gz out="$NAME"_trimmed.fq.gz minlen=60 ktrim=r k=21 mink=9 hdist=2 hdist2=1 ref="$NAME"_adapters.fa altref=adapters maq=14 qtrim=r trimq=10 maxns=0 tbo tpe ow -Xmx1g ftm=5

Comprehensive trimming including:

4.2 Primer Trimming

bbduk.sh in="$NAME"_trimmed.fq.gz out="$NAME"_trimmed2.fq.gz ref="$PRIMERS" ktrim=l restrictleft=30 k=22 hdist=3 qhdist=1 rcomp=f mm=f

Trims ARTIC v3 primers from the left end of reads:

Note: Disable this step if not using primer amplification.

5. Alignment and Mapping

5.1 Reference Alignment

bbmap.sh ref="$REF" in="$NAME"_trimmed2.fq.gz outm="$NAME"_mapped.sam.gz nodisk local maxindel=500 -Xmx4g ow k=12

Aligns reads to SARS-CoV-2 reference genome with amplicon-specific settings:

6. Post-Alignment Processing

6.1 Coordinate-Based Deduplication

dedupebymapping.sh in="$NAME"_mapped.sam.gz out="$NAME"_deduped.sam.gz -Xmx31g ow

Removes duplicates based on mapping coordinates and strand. Warning: Single-ended amplicon data will lose most reads in this step due to identical start positions.

6.2 Chimeric Read Filtering - Phase 1

filtersam.sh ref="$REF" ow in="$NAME"_deduped.sam.gz out="$NAME"_filtered.sam.gz mbad=1 del sub=f mbv=0 -Xmx4g

Removes reads with unsupported unique deletions (often chimeric artifacts):

6.3 Chimeric Read Filtering - Phase 2

filtersam.sh ref="$REF" ow in="$NAME"_filtered.sam.gz out="$NAME"_filtered2.sam.gz mbad=1 sub mbv=2 -Xmx4g

Removes reads with multiple unsupported substitutions (reduces NovaSeq noise):

6.4 Soft-Clip Trimming

bbduk.sh in="$NAME"_filtered2.sam.gz trimclip out="$NAME"_trimclip.sam.gz -Xmx1g ow

Removes soft-clipped bases from alignments to improve variant calling accuracy by eliminating poorly aligned end sequences.

7. Variant Calling and Consensus Generation

7.1 Variant Detection

callvariants.sh in="$NAME"_trimclip.sam.gz ref="$REF" out="$NAME"_vars.vcf -Xmx4g ow strandedcov usebias=f minstrandratio=0 maf=0.6 minreads="$MINCOV" mincov="$MINCOV" minedistmax=30 minedist=16 flagnearby

Calls variants with amplicon-specific parameters:

Note: Remove usebias=f and minstrandratio=0 for shotgun/metagenomic data.

7.2 Coverage Analysis

pileup.sh in="$NAME"_trimclip.sam.gz basecov="$NAME"_basecov_border5.txt -Xmx4g ow border=5

Calculates per-base coverage ignoring the outermost 5bp of reads (matching CallVariants defaults). This provides accurate coverage estimates for consensus generation.

7.3 Consensus Genome Generation

applyvariants.sh in="$REF" out="$NAME"_genome.fa vcf="$NAME"_vars.vcf basecov="$NAME"_basecov_border5.txt ow mindepth="$MINCOV"

Generates the consensus genome by applying detected variants to the reference:

8. Optional Visualization Preparation

8.1 BAM File Generation (Optional)

# Requires samtools installation
# samtools view -bShu "$NAME"_trimclip.sam.gz | samtools sort -m 2G -@ 3 - -o "$NAME"_sorted.bam
# samtools index "$NAME"_sorted.bam

Creates sorted and indexed BAM files for visualization in IGV or other genome browsers. These files, along with the reference and VCF, provide complete visualization capability.

Basic Usage

# Basic usage with sample prefix
processCorona.sh Sample1

# This assumes input files named:
# Sample1.fq.gz (interleaved) OR Sample1_R1.fq.gz and Sample1_R2.fq.gz (separate)

# Required files in working directory:
# - NC_045512.fasta (SARS-CoV-2 reference)
# - artic3.fasta (ARTIC v3 primers)
# - recalibration matrix (optional but recommended)

Configuration Parameters

Key parameters that can be modified in the script:

Memory Requirements

Output Files

The pipeline generates the following key output files:

Quality Control Considerations

Amplicon-Specific Optimizations

Coverage and Variant Calling

Troubleshooting

Common Issues

Adapting for Different Data Types

Pipeline Validation

For visualization and validation in IGV:

  1. Load the reference genome: NC_045512.fasta
  2. Load the sorted BAM file: [NAME]_sorted.bam
  3. Load the variants file: [NAME]_vars.vcf
  4. Examine coverage patterns and variant calls
  5. Validate consensus sequence: [NAME]_genome.fa