variantPipeline.sh
Comprehensive variant calling pipeline that preprocesses Illumina sequencing data through deduplication, quality filtering, adapter trimming, contamination removal, mapping, and variant calling. Includes optional advanced modules for error correction, quality recalibration, and NovaSeq-specific error filtering.
Purpose
This pipeline provides a complete workflow for high-quality variant calling from Illumina sequencing data. It includes essential preprocessing steps, reference mapping, primary variant calling, and optional advanced processing modules for error correction and quality improvement. The pipeline is optimized for 2x150bp HiSeq 2500 data but can be adapted for other platforms.
Prerequisites
- Input Data: Illumina FASTQ files (single-ended, paired-ended, or interleaved)
- Reference Genome: High-quality reference sequence
- Read Headers: Standard Illumina headers required for some steps (deduplication, tile filtering)
- System Resources: Variable memory requirements depending on data size
Pipeline Structure
The pipeline is organized into three main sections:
- Core Pipeline: Essential preprocessing and variant calling
- Optional Enhancement: Error correction and recalibration
- NovaSeq Module: Platform-specific error filtering
Core Pipeline Stages
Stage 1: Duplicate Removal
# Remove optical and PCR duplicates
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical
Key Points:
optical
- Removes optical duplicates (requires standard Illumina headers)dedupe
- Also removes PCR duplicates- Note: Skip optical flag for renamed reads (e.g., SRA data)
- RNA-seq: Deduplication generally not recommended for quantification experiments
Stage 2: Tile-Based Quality Filtering
# Remove reads from low-quality tiles
filterbytile.sh in=clumped.fq.gz out=filtered_by_tile.fq.gz
Requirements: Standard Illumina read headers; will not work with renamed reads from SRA
Stage 3: Adapter Trimming
# Trim adapter sequences
bbduk.sh in=filtered_by_tile.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=100 ref=bbmap/resources/adapters.fa ftm=5 ordered
Parameters:
ktrim=r
- Trim from right end (3' trimming)k=23 mink=11
- Use 23-mers, minimum 11-mers for short adapter remnantstbo tpe
- Trim by overlap and pair enforcementminlen=100
- Minimum read length after trimmingftm=5
- Force trim modulo 5 (prevents short insertions)
Stage 4: Contamination Removal
# Remove synthetic artifacts and PhiX
bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=27 ref=bbmap/resources/sequencing_artifacts.fa.gz,bbmap/resources/phix174_ill.ref.fa.gz ordered
Contaminant References:
sequencing_artifacts.fa.gz
- Synthetic sequences and adaptersphix174_ill.ref.fa.gz
- PhiX control sequences- Quality Trimming: Add
qtrim=r trimq=8
if not doing recalibration later
Stage 5: Reference Mapping
# Map reads to reference genome
bbmap.sh in=filtered.fq.gz out=mapped.sam.gz bs=bs.sh pigz unpigz ref=reference.fa
Mapping Features:
bs=bs.sh
- Generates build script for reference indexingpigz unpigz
- Use parallel gzip for compression/decompression- Default parameters optimized for accuracy
Stage 6: Variant Calling
# Call variants from mapped reads
callvariants.sh in=mapped.sam.gz out=vars.vcf.gz ref=reference.fa ploidy=1 prefilter
Parameters:
ploidy=1
- Assumes haploid organism (adjust for diploid)prefilter
- Apply quality filters to improve call accuracy
Optional Enhancement Module
Advanced processing for improved accuracy through error correction and recalibration:
Recalibration Matrix Generation
# Generate recalibration matrices from initial variant calls
calctruequality.sh in=mapped.sam.gz vcf=vars.vcf.gz
Quality Recalibration
# Apply quality recalibration to reads
bbduk.sh in=filtered.fq.gz out=recal.fq.gz recalibrate ordered
Note: Recalibration applied to unmapped reads for subsequent error correction
Error Correction by Overlap
# Error-correct paired reads using overlap consensus
bbmerge.sh in=recal.fq.gz out=ecco.fq.gz ecco strict mix ordered
Requirements: Paired reads only; single-end reads skip this step
Quality Trimming
# Quality trim after error correction
bbduk.sh in=ecco.fq.gz out=qtrimmed.fq.gz qtrim=r trimq=8 ordered
Re-mapping and Re-calling
# Map error-corrected reads
bbmap.sh in=qtrimmed.fq.gz out=mapped2.sam.gz pigz unpigz
# Call variants with improved data
callvariants.sh in=mapped2.sam.gz out=vars2.vcf.gz ref=reference.fa ploidy=1 prefilter
NovaSeq Error Filtering Module
Special processing for NovaSeq-specific error patterns:
Liberal Initial Variant Calling
# Call variants with liberal settings for error detection
callvariants.sh in=mapped.sam vcf=vars.vcf.gz ref=reference.fa clearfilters minreads=2 ploidy=1 prefilter
Bad Read Filtering
# Remove reads with multiple NovaSeq-specific errors
filtersam.sh in=mapped.sam.gz out=clean.sam.gz outb=dirty.sam.gz vcf=vars.vcf.gz maxbadsubs=2 mbsad=2 mbsrd=2
Parameters:
maxbadsubs=2
- Maximum bad substitutions per readmbsad=2
- Maximum bad substitutions per alignmentmbsrd=2
- Maximum bad substitutions per read pairoutb=dirty.sam.gz
- Output filtered reads for analysis
Final Variant Calling
# Call variants on cleaned reads
callvariants.sh in=clean.sam.gz out=vars2.vcf.gz ref=reference.fa ploidy=1 prefilter
Usage Examples
Basic Variant Calling
# Prepare input files
ln -s sample.fq.gz reads.fq.gz
ln -s genome.fa reference.fa
# Run core pipeline only
# Execute stages 1-6 from the script
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical
filterbytile.sh in=clumped.fq.gz out=filtered_by_tile.fq.gz
bbduk.sh in=filtered_by_tile.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=100 ref=bbmap/resources/adapters.fa ftm=5 ordered
bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=27 ref=bbmap/resources/sequencing_artifacts.fa.gz,bbmap/resources/phix174_ill.ref.fa.gz ordered
bbmap.sh in=filtered.fq.gz out=mapped.sam.gz bs=bs.sh pigz unpigz ref=reference.fa
callvariants.sh in=mapped.sam.gz out=vars.vcf.gz ref=reference.fa ploidy=1 prefilter
Enhanced Processing with Error Correction
# Run core pipeline plus enhancement module
# ... core pipeline steps ...
calctruequality.sh in=mapped.sam.gz vcf=vars.vcf.gz
bbduk.sh in=filtered.fq.gz out=recal.fq.gz recalibrate ordered
bbmerge.sh in=recal.fq.gz out=ecco.fq.gz ecco strict mix ordered
bbduk.sh in=ecco.fq.gz out=qtrimmed.fq.gz qtrim=r trimq=8 ordered
bbmap.sh in=qtrimmed.fq.gz out=mapped2.sam.gz pigz unpigz
callvariants.sh in=mapped2.sam.gz out=vars2.vcf.gz ref=reference.fa ploidy=1 prefilter
NovaSeq Error Handling
# Include NovaSeq-specific error filtering
# ... initial mapping ...
callvariants.sh in=mapped.sam vcf=vars.vcf.gz ref=reference.fa clearfilters minreads=2 ploidy=1 prefilter
filtersam.sh in=mapped.sam.gz out=clean.sam.gz outb=dirty.sam.gz vcf=vars.vcf.gz maxbadsubs=2 mbsad=2 mbsrd=2
callvariants.sh in=clean.sam.gz out=vars2.vcf.gz ref=reference.fa ploidy=1 prefilter
Platform Adaptations
For Different Illumina Platforms
- MiSeq/NextSeq: May require different adapter sequences
- NovaSeq: Include NovaSeq error filtering module
- SRA Data: Omit optical deduplication and tile filtering
For Paired vs Single-End Data
# For paired files in separate streams
bbduk.sh in1=read1.fq.gz in2=read2.fq.gz out1=clean1.fq.gz out2=clean2.fq.gz [parameters]
# For interleaved paired reads (default assumption)
bbduk.sh in=interleaved.fq.gz out=clean.fq.gz [parameters]
Quality Control Considerations
Essential QC Steps
- Duplicate Removal: Critical for variant calling accuracy
- Adapter Trimming: Prevents false variants from adapter sequences
- Contamination Removal: Eliminates false signals from spike-ins
- Quality Filtering: Improves variant call confidence
Platform-Specific Issues
- HiSeq 2500: Generally high quality, standard processing
- NovaSeq: May have systematic errors requiring special filtering
- RNA-seq Data: Skip deduplication step to preserve quantification
Output Files
Core Pipeline Outputs
- vars.vcf.gz: Primary variant calls
- mapped.sam.gz: Aligned reads
- filtered.fq.gz: Preprocessed reads
- bs.sh: Reference build script
Enhanced Processing Outputs
- vars2.vcf.gz: Improved variant calls after error correction
- mapped2.sam.gz: Re-mapped error-corrected reads
- qtrimmed.fq.gz: Error-corrected and quality-trimmed reads
NovaSeq Module Outputs
- clean.sam.gz: Reads after error filtering
- dirty.sam.gz: Filtered out problematic reads
Performance Optimization
- Memory Usage: Adjust based on genome size and read count
- Parallel Processing: Most tools support multi-threading
- Disk I/O: Use fast storage for intermediate files
- Compression: Pipeline uses parallel gzip where beneficial
Related Tools
clumpify.sh
- Duplicate removal and clusteringfilterbytile.sh
- Tile-based quality filteringbbduk.sh
- Quality control, trimming, and recalibrationbbmap.sh
- High-accuracy read mappingbbmerge.sh
- Error correction by overlapcallvariants.sh
- Variant calling from alignmentscalctruequality.sh
- Quality score recalibrationfiltersam.sh
- SAM/BAM filtering and quality control