COVID-19 Recalibration Pipeline

Script: recal.sh Source Directory: pipelines/covid/ Author: Brian Bushnell Last Updated: May 21, 2020

Creates quality-score recalibration matrices for processing Illumina paired-end reads in COVID-19 genomic analysis. This pipeline must be run once on a representative library from a sequencing run to generate recalibration matrices that can be used to improve base quality scores for all libraries processed in the same directory.

Overview

The COVID-19 recalibration pipeline addresses systematic biases in Illumina quality scores by creating empirical recalibration matrices based on real variant data. Quality score recalibration is essential for accurate variant calling, as sequencing instruments often over- or under-estimate quality scores in systematic ways.

This pipeline maps reads to the SARS-CoV-2 reference genome, identifies true variants, and generates recalibration matrices that can be applied to improve quality score accuracy for downstream analysis. The recalibration process is particularly important for viral genomics where high accuracy is crucial for detecting low-frequency variants and mutations.

Important: This script needs to be run ONCE on a single library (preferably a large one) from a sequencing run. The generated recalibration matrices can then be used for all libraries processed in the same directory.

Prerequisites

System Requirements

Input Requirements

Data Format Compatibility

Usage

recal.sh <prefix>

Where <prefix> is the sample name prefix. For example, if your data is in Sample1.fq.gz, run:

recal.sh Sample1

Pipeline Stages

1. Configuration

The pipeline sets key parameters that may need adjustment for your data:

2. Adapter Discovery

bbmerge.sh in="$NAME".fq.gz outa="$NAME"_adapters.fa ow reads=1m

Discovers adapter sequences specific to this library based on read overlap patterns. This step analyzes 1 million reads to identify adapter sequences that will be used for trimming. This step should be skipped for single-ended reads.

3. Adapter Trimming and Length Filtering

bbduk.sh -Xmx1g in="$NAME".fq.gz out=recal.fq.gz minlen="$READLEN" ktrim=r k=21 mink=9 hdist=2 hdist2=1 ref="$NAME"_adapters.fa altref=adapters ow tbo tpe

Performs aggressive adapter trimming to ensure uniform read lengths. Key parameters:

Alternative Command for Single-End Reads

bbduk.sh -Xmx1g in="$NAME".fq.gz out=trimmed.fq.gz minlen="$READLEN" ktrim=r k=21 mink=9 hdist=2 hdist2=1 ref=adapters ow

For single-end data, use the standard adapter reference and omit paired-read specific flags.

4. High-Sensitivity Mapping

bbmap.sh ref="$REF" in=recal.fq.gz outm=mapped.sam.gz vslow -Xmx6g ow

Maps reads to the SARS-CoV-2 reference genome with maximum sensitivity settings. The vslow flag enables the most sensitive alignment parameters, crucial for accurate variant detection:

5. Variant Discovery

callvariants.sh in=mapped.sam.gz ref="$REF" out=recal.vcf -Xmx6g ow

Identifies true genetic variants in the mapped data. This initial variant calling uses standard parameters to establish a baseline set of variants that will inform the recalibration process.

6. Recalibration Matrix Generation

callvariants.sh in=mapped.sam.gz ref="$REF" out=recal.vcf -Xmx6g ow usebias=f strandedcov minstrandratio=0

Generates the actual recalibration matrices using specialized parameters:

This step creates recalibration matrices stored in the ./ref directory that model the relationship between reported quality scores and empirical error rates based on sequence context, position in read, and other covariates.

Algorithm Details

Quality Score Recalibration Theory

Quality score recalibration addresses systematic biases in Illumina quality scores through empirical analysis. Sequencing instruments assign quality scores based on internal models, but these often contain systematic errors due to:

Recalibration Process

The recalibration pipeline works by:

  1. High-quality mapping: Ensures accurate alignment to distinguish true variants from sequencing errors
  2. Variant identification: Establishes ground truth for distinguishing real variants from errors
  3. Error rate calculation: Compares reported quality scores to empirical error rates at non-variant positions
  4. Covariate modeling: Builds models incorporating sequence context, position, and other factors
  5. Matrix generation: Creates lookup tables for quality score correction

SARS-CoV-2 Specific Considerations

This pipeline is optimized for COVID-19 analysis with several key features:

Output Files and Usage

Generated Files

Using Recalibration Matrices

After generating recalibration matrices, use them with BBDuk for quality score improvement:

# Apply recalibration to new data (must be run from directory containing ./ref)
bbduk.sh in=sample.fq.gz out=recalibrated.fq.gz recal

The recal flag applies the generated recalibration matrices to improve quality scores in any FASTQ file, whether mapped or unmapped. This results in more accurate quality scores that better reflect the actual probability of sequencing errors.

Quality Improvements

Recalibrated data typically shows:

Best Practices

Library Selection

Parameter Adjustments

Quality Control

Troubleshooting

Common Issues

File Format Issues

Performance Optimization

Integration with Other Tools

Pipeline Integration

The recalibration pipeline integrates with other BBTools components:

Workflow Recommendations

  1. Run recalibration pipeline on representative library
  2. Apply recalibration to all libraries using BBDuk with 'recal' flag
  3. Perform downstream analysis (mapping, variant calling) on recalibrated data
  4. Validate improvements through comparison with non-recalibrated results

Performance Characteristics

Memory Requirements

Processing Time

Disk Space