COVID-19 Recalibration Pipeline

Overview

The COVID-19 recalibration pipeline addresses systematic biases in Illumina quality scores by creating empirical recalibration matrices based on real variant data. Quality score recalibration is essential for accurate variant calling, as sequencing instruments often over- or under-estimate quality scores in systematic ways.

This pipeline maps reads to the SARS-CoV-2 reference genome, identifies true variants, and generates recalibration matrices that can be applied to improve quality score accuracy for downstream analysis. The recalibration process is particularly important for viral genomics where high accuracy is crucial for detecting low-frequency variants and mutations.

Important: This script needs to be run ONCE on a single library (preferably a large one) from a sequencing run. The generated recalibration matrices can then be used for all libraries processed in the same directory.

Prerequisites

System Requirements

BBTools suite installed with BBMerge, BBDuk, BBMap, and CallVariants
At least 6GB of available memory for mapping and variant calling
SARS-CoV-2 reference genome (NC_045512.fasta)

Input Requirements

High-quality Illumina paired-end reads (preferably from a large library)
Reads can be single-ended, paired and interleaved, or paired in twin files
Input file named as [PREFIX].fq.gz where PREFIX is provided as command line argument
SARS-CoV-2 reference genome file (NC_045512.fasta) in the working directory

Data Format Compatibility

Interleaved paired-end: Direct processing (recommended format)
Twin file format: First interleave with reformat.sh in1=r1.fq.gz in2=r2.fq.gz out=both.fq.gz
Single-end reads: Skip adapter discovery step and use modified BBDuk command

Usage

recal.sh <prefix>

Where <prefix> is the sample name prefix. For example, if your data is in Sample1.fq.gz, run:

recal.sh Sample1

Pipeline Stages

1. Configuration

The pipeline sets key parameters that may need adjustment for your data:

READLEN=150: Expected read length for uniform trimming
REF="NC_045512.fasta": SARS-CoV-2 reference genome (equivalent to bbmap/resources/Covid19_ref.fa)

2. Adapter Discovery

bbmerge.sh in="$NAME".fq.gz outa="$NAME"_adapters.fa ow reads=1m

Discovers adapter sequences specific to this library based on read overlap patterns. This step analyzes 1 million reads to identify adapter sequences that will be used for trimming. This step should be skipped for single-ended reads.

3. Adapter Trimming and Length Filtering

bbduk.sh -Xmx1g in="$NAME".fq.gz out=recal.fq.gz minlen="$READLEN" ktrim=r k=21 mink=9 hdist=2 hdist2=1 ref="$NAME"_adapters.fa altref=adapters ow tbo tpe

Performs aggressive adapter trimming to ensure uniform read lengths. Key parameters:

ktrim=r: Trim adapters from the right end
k=21, mink=9: K-mer sizes for adapter detection (21bp primary, 9bp minimum)
hdist=2, hdist2=1: Hamming distance tolerance (2 for k=21, 1 for shorter k-mers)
minlen=150: Discard reads shorter than the specified length
tbo, tpe: Trim both reads to same length, trim paired reads to same length

Alternative Command for Single-End Reads

bbduk.sh -Xmx1g in="$NAME".fq.gz out=trimmed.fq.gz minlen="$READLEN" ktrim=r k=21 mink=9 hdist=2 hdist2=1 ref=adapters ow

For single-end data, use the standard adapter reference and omit paired-read specific flags.

4. High-Sensitivity Mapping

bbmap.sh ref="$REF" in=recal.fq.gz outm=mapped.sam.gz vslow -Xmx6g ow

Maps reads to the SARS-CoV-2 reference genome with maximum sensitivity settings. The vslow flag enables the most sensitive alignment parameters, crucial for accurate variant detection:

Slower but more accurate alignment
Better detection of variants and indels
Essential for generating high-quality recalibration data
Requires 6GB memory allocation

5. Variant Discovery

callvariants.sh in=mapped.sam.gz ref="$REF" out=recal.vcf -Xmx6g ow

Identifies true genetic variants in the mapped data. This initial variant calling uses standard parameters to establish a baseline set of variants that will inform the recalibration process.

6. Recalibration Matrix Generation

callvariants.sh in=mapped.sam.gz ref="$REF" out=recal.vcf -Xmx6g ow usebias=f strandedcov minstrandratio=0

Generates the actual recalibration matrices using specialized parameters:

usebias=f: Disable bias filtering to capture all systematic errors
strandedcov: Calculate strand-specific coverage for bias detection
minstrandratio=0: Include all positions regardless of strand ratio

This step creates recalibration matrices stored in the ./ref directory that model the relationship between reported quality scores and empirical error rates based on sequence context, position in read, and other covariates.

Algorithm Details

Quality Score Recalibration Theory

Quality score recalibration addresses systematic biases in Illumina quality scores through empirical analysis. Sequencing instruments assign quality scores based on internal models, but these often contain systematic errors due to:

Sequence context effects: Certain dinucleotides or sequence motifs are harder to sequence accurately
Position-in-read effects: Quality typically degrades toward the 3' end of reads
Machine learning artifacts: Instrument software may have systematic biases in quality assignment
Flow cell effects: Physical properties of the sequencing reaction can vary systematically

Recalibration Process

The recalibration pipeline works by:

High-quality mapping: Ensures accurate alignment to distinguish true variants from sequencing errors
Variant identification: Establishes ground truth for distinguishing real variants from errors
Error rate calculation: Compares reported quality scores to empirical error rates at non-variant positions
Covariate modeling: Builds models incorporating sequence context, position, and other factors
Matrix generation: Creates lookup tables for quality score correction

SARS-CoV-2 Specific Considerations

This pipeline is optimized for COVID-19 analysis with several key features:

Viral genome reference: Uses SARS-CoV-2 reference (NC_045512) for accurate mapping
High sensitivity mapping: Critical for detecting low-frequency variants important in viral evolution
Uniform read length: Ensures consistent quality patterns across all reads
Strand-specific analysis: Accounts for potential strand biases in viral RNA sequencing

Output Files and Usage

Generated Files

[PREFIX]_adapters.fa: Library-specific adapter sequences identified by BBMerge
recal.fq.gz: Adapter-trimmed, length-filtered reads used for mapping
mapped.sam.gz: High-quality alignments to SARS-CoV-2 reference
recal.vcf: Identified variants used for recalibration modeling
./ref/ directory: Contains recalibration matrices and reference data

Using Recalibration Matrices

After generating recalibration matrices, use them with BBDuk for quality score improvement:

# Apply recalibration to new data (must be run from directory containing ./ref)
bbduk.sh in=sample.fq.gz out=recalibrated.fq.gz recal

The recal flag applies the generated recalibration matrices to improve quality scores in any FASTQ file, whether mapped or unmapped. This results in more accurate quality scores that better reflect the actual probability of sequencing errors.

Quality Improvements

Recalibrated data typically shows:

Better calibrated quality scores: Reported quality more accurately reflects error probability
Improved variant calling: More accurate quality scores lead to better variant detection
Reduced false positives: Better quality scores reduce spurious variant calls
Enhanced downstream analysis: All quality-aware tools benefit from improved scores

Best Practices

Library Selection

Use large libraries: More data provides better recalibration statistics
Representative samples: Choose libraries typical of your sequencing run
High quality data: Use your best library for recalibration matrix generation
Single run basis: Generate new matrices for each sequencing run or major protocol change

Parameter Adjustments

Read length (READLEN): Set to your actual read length (150bp default)
Reference genome: Ensure NC_045512.fasta is current SARS-CoV-2 reference
Memory allocation: Increase -Xmx values for very large datasets
Single-end data: Use alternative BBDuk command and skip adapter discovery

Quality Control

Check mapping rates: Should be high for COVID-19 samples (>90%)
Verify variant counts: Should identify reasonable number of variants
Validate recalibration: Compare before/after quality score distributions
Test on subset: Validate recalibration improves downstream analysis

Troubleshooting

Common Issues

Low mapping rates: Verify reference genome and sample quality
Memory errors: Increase memory allocation or reduce dataset size
No adapters found: May indicate high-quality data or single-end reads
Few variants called: Could indicate low coverage or homogeneous sample

File Format Issues

Twin files: Must interleave before running pipeline
Single-end data: Skip adapter discovery and use modified commands
Different read lengths: Adjust READLEN parameter accordingly
Compressed files: Pipeline expects .gz compressed input

Performance Optimization

Large datasets: Consider subsampling for recalibration matrix generation
Memory constraints: Reduce dataset size or increase available memory
Speed vs accuracy: The 'vslow' mapping is essential for quality recalibration

Integration with Other Tools

Pipeline Integration

The recalibration pipeline integrates with other BBTools components:

BBDuk: Uses 'recal' flag to apply recalibration matrices
CallVariants: Benefits from improved quality scores for variant calling
BBMap: Can use recalibrated data for more accurate mapping statistics
Other BBTools: Any quality-aware tool benefits from recalibrated data

Workflow Recommendations

Run recalibration pipeline on representative library
Apply recalibration to all libraries using BBDuk with 'recal' flag
Perform downstream analysis (mapping, variant calling) on recalibrated data
Validate improvements through comparison with non-recalibrated results

Performance Characteristics

Memory Requirements

BBMerge: Standard memory usage (typically <2GB)
BBDuk: 1GB allocated for adapter trimming
BBMap: 6GB allocated for high-sensitivity mapping
CallVariants: 6GB allocated for variant calling and matrix generation

Processing Time

Adapter discovery: Fast, processes 1M reads subset
Trimming: Linear with input size, very fast
Mapping: Slowest step due to 'vslow' sensitivity mode
Variant calling: Moderate, depends on coverage and genome complexity

Disk Space

Intermediate files: Several times input size for temporary files
Final matrices: Small, typically <1MB in ./ref directory
SAM files: Can be large; consider compression for storage
Cleanup: Remove intermediate files after successful completion