COVID-19 Recalibration Pipeline
Creates quality-score recalibration matrices for processing Illumina paired-end reads in COVID-19 genomic analysis. This pipeline must be run once on a representative library from a sequencing run to generate recalibration matrices that can be used to improve base quality scores for all libraries processed in the same directory.
Overview
The COVID-19 recalibration pipeline addresses systematic biases in Illumina quality scores by creating empirical recalibration matrices based on real variant data. Quality score recalibration is essential for accurate variant calling, as sequencing instruments often over- or under-estimate quality scores in systematic ways.
This pipeline maps reads to the SARS-CoV-2 reference genome, identifies true variants, and generates recalibration matrices that can be applied to improve quality score accuracy for downstream analysis. The recalibration process is particularly important for viral genomics where high accuracy is crucial for detecting low-frequency variants and mutations.
Prerequisites
System Requirements
- BBTools suite installed with BBMerge, BBDuk, BBMap, and CallVariants
- At least 6GB of available memory for mapping and variant calling
- SARS-CoV-2 reference genome (NC_045512.fasta)
Input Requirements
- High-quality Illumina paired-end reads (preferably from a large library)
- Reads can be single-ended, paired and interleaved, or paired in twin files
- Input file named as [PREFIX].fq.gz where PREFIX is provided as command line argument
- SARS-CoV-2 reference genome file (NC_045512.fasta) in the working directory
Data Format Compatibility
- Interleaved paired-end: Direct processing (recommended format)
- Twin file format: First interleave with
reformat.sh in1=r1.fq.gz in2=r2.fq.gz out=both.fq.gz
- Single-end reads: Skip adapter discovery step and use modified BBDuk command
Usage
recal.sh <prefix>
Where <prefix>
is the sample name prefix. For example, if your data is in Sample1.fq.gz
, run:
recal.sh Sample1
Pipeline Stages
1. Configuration
The pipeline sets key parameters that may need adjustment for your data:
- READLEN=150: Expected read length for uniform trimming
- REF="NC_045512.fasta": SARS-CoV-2 reference genome (equivalent to bbmap/resources/Covid19_ref.fa)
2. Adapter Discovery
bbmerge.sh in="$NAME".fq.gz outa="$NAME"_adapters.fa ow reads=1m
Discovers adapter sequences specific to this library based on read overlap patterns. This step analyzes 1 million reads to identify adapter sequences that will be used for trimming. This step should be skipped for single-ended reads.
3. Adapter Trimming and Length Filtering
bbduk.sh -Xmx1g in="$NAME".fq.gz out=recal.fq.gz minlen="$READLEN" ktrim=r k=21 mink=9 hdist=2 hdist2=1 ref="$NAME"_adapters.fa altref=adapters ow tbo tpe
Performs aggressive adapter trimming to ensure uniform read lengths. Key parameters:
- ktrim=r: Trim adapters from the right end
- k=21, mink=9: K-mer sizes for adapter detection (21bp primary, 9bp minimum)
- hdist=2, hdist2=1: Hamming distance tolerance (2 for k=21, 1 for shorter k-mers)
- minlen=150: Discard reads shorter than the specified length
- tbo, tpe: Trim both reads to same length, trim paired reads to same length
Alternative Command for Single-End Reads
bbduk.sh -Xmx1g in="$NAME".fq.gz out=trimmed.fq.gz minlen="$READLEN" ktrim=r k=21 mink=9 hdist=2 hdist2=1 ref=adapters ow
For single-end data, use the standard adapter reference and omit paired-read specific flags.
4. High-Sensitivity Mapping
bbmap.sh ref="$REF" in=recal.fq.gz outm=mapped.sam.gz vslow -Xmx6g ow
Maps reads to the SARS-CoV-2 reference genome with maximum sensitivity settings. The vslow
flag enables the most sensitive alignment parameters, crucial for accurate variant detection:
- Slower but more accurate alignment
- Better detection of variants and indels
- Essential for generating high-quality recalibration data
- Requires 6GB memory allocation
5. Variant Discovery
callvariants.sh in=mapped.sam.gz ref="$REF" out=recal.vcf -Xmx6g ow
Identifies true genetic variants in the mapped data. This initial variant calling uses standard parameters to establish a baseline set of variants that will inform the recalibration process.
6. Recalibration Matrix Generation
callvariants.sh in=mapped.sam.gz ref="$REF" out=recal.vcf -Xmx6g ow usebias=f strandedcov minstrandratio=0
Generates the actual recalibration matrices using specialized parameters:
- usebias=f: Disable bias filtering to capture all systematic errors
- strandedcov: Calculate strand-specific coverage for bias detection
- minstrandratio=0: Include all positions regardless of strand ratio
This step creates recalibration matrices stored in the ./ref
directory that model the relationship between reported quality scores and empirical error rates based on sequence context, position in read, and other covariates.
Algorithm Details
Quality Score Recalibration Theory
Quality score recalibration addresses systematic biases in Illumina quality scores through empirical analysis. Sequencing instruments assign quality scores based on internal models, but these often contain systematic errors due to:
- Sequence context effects: Certain dinucleotides or sequence motifs are harder to sequence accurately
- Position-in-read effects: Quality typically degrades toward the 3' end of reads
- Machine learning artifacts: Instrument software may have systematic biases in quality assignment
- Flow cell effects: Physical properties of the sequencing reaction can vary systematically
Recalibration Process
The recalibration pipeline works by:
- High-quality mapping: Ensures accurate alignment to distinguish true variants from sequencing errors
- Variant identification: Establishes ground truth for distinguishing real variants from errors
- Error rate calculation: Compares reported quality scores to empirical error rates at non-variant positions
- Covariate modeling: Builds models incorporating sequence context, position, and other factors
- Matrix generation: Creates lookup tables for quality score correction
SARS-CoV-2 Specific Considerations
This pipeline is optimized for COVID-19 analysis with several key features:
- Viral genome reference: Uses SARS-CoV-2 reference (NC_045512) for accurate mapping
- High sensitivity mapping: Critical for detecting low-frequency variants important in viral evolution
- Uniform read length: Ensures consistent quality patterns across all reads
- Strand-specific analysis: Accounts for potential strand biases in viral RNA sequencing
Output Files and Usage
Generated Files
- [PREFIX]_adapters.fa: Library-specific adapter sequences identified by BBMerge
- recal.fq.gz: Adapter-trimmed, length-filtered reads used for mapping
- mapped.sam.gz: High-quality alignments to SARS-CoV-2 reference
- recal.vcf: Identified variants used for recalibration modeling
- ./ref/ directory: Contains recalibration matrices and reference data
Using Recalibration Matrices
After generating recalibration matrices, use them with BBDuk for quality score improvement:
# Apply recalibration to new data (must be run from directory containing ./ref)
bbduk.sh in=sample.fq.gz out=recalibrated.fq.gz recal
The recal
flag applies the generated recalibration matrices to improve quality scores in any FASTQ file, whether mapped or unmapped. This results in more accurate quality scores that better reflect the actual probability of sequencing errors.
Quality Improvements
Recalibrated data typically shows:
- Better calibrated quality scores: Reported quality more accurately reflects error probability
- Improved variant calling: More accurate quality scores lead to better variant detection
- Reduced false positives: Better quality scores reduce spurious variant calls
- Enhanced downstream analysis: All quality-aware tools benefit from improved scores
Best Practices
Library Selection
- Use large libraries: More data provides better recalibration statistics
- Representative samples: Choose libraries typical of your sequencing run
- High quality data: Use your best library for recalibration matrix generation
- Single run basis: Generate new matrices for each sequencing run or major protocol change
Parameter Adjustments
- Read length (READLEN): Set to your actual read length (150bp default)
- Reference genome: Ensure NC_045512.fasta is current SARS-CoV-2 reference
- Memory allocation: Increase -Xmx values for very large datasets
- Single-end data: Use alternative BBDuk command and skip adapter discovery
Quality Control
- Check mapping rates: Should be high for COVID-19 samples (>90%)
- Verify variant counts: Should identify reasonable number of variants
- Validate recalibration: Compare before/after quality score distributions
- Test on subset: Validate recalibration improves downstream analysis
Troubleshooting
Common Issues
- Low mapping rates: Verify reference genome and sample quality
- Memory errors: Increase memory allocation or reduce dataset size
- No adapters found: May indicate high-quality data or single-end reads
- Few variants called: Could indicate low coverage or homogeneous sample
File Format Issues
- Twin files: Must interleave before running pipeline
- Single-end data: Skip adapter discovery and use modified commands
- Different read lengths: Adjust READLEN parameter accordingly
- Compressed files: Pipeline expects .gz compressed input
Performance Optimization
- Large datasets: Consider subsampling for recalibration matrix generation
- Memory constraints: Reduce dataset size or increase available memory
- Speed vs accuracy: The 'vslow' mapping is essential for quality recalibration
Integration with Other Tools
Pipeline Integration
The recalibration pipeline integrates with other BBTools components:
- BBDuk: Uses 'recal' flag to apply recalibration matrices
- CallVariants: Benefits from improved quality scores for variant calling
- BBMap: Can use recalibrated data for more accurate mapping statistics
- Other BBTools: Any quality-aware tool benefits from recalibrated data
Workflow Recommendations
- Run recalibration pipeline on representative library
- Apply recalibration to all libraries using BBDuk with 'recal' flag
- Perform downstream analysis (mapping, variant calling) on recalibrated data
- Validate improvements through comparison with non-recalibrated results
Performance Characteristics
Memory Requirements
- BBMerge: Standard memory usage (typically <2GB)
- BBDuk: 1GB allocated for adapter trimming
- BBMap: 6GB allocated for high-sensitivity mapping
- CallVariants: 6GB allocated for variant calling and matrix generation
Processing Time
- Adapter discovery: Fast, processes 1M reads subset
- Trimming: Linear with input size, very fast
- Mapping: Slowest step due to 'vslow' sensitivity mode
- Variant calling: Moderate, depends on coverage and genome complexity
Disk Space
- Intermediate files: Several times input size for temporary files
- Final matrices: Small, typically <1MB in ./ref directory
- SAM files: Can be large; consider compression for storage
- Cleanup: Remove intermediate files after successful completion