variantPipeline.sh

Script: variantPipeline.sh Author: Brian Bushnell Last Updated: January 29, 2020 Target Platform: Illumina HiSeq 2500 (2x150bp)

Comprehensive variant calling pipeline that preprocesses Illumina sequencing data through deduplication, quality filtering, adapter trimming, contamination removal, mapping, and variant calling. Includes optional advanced modules for error correction, quality recalibration, and NovaSeq-specific error filtering.

Purpose

This pipeline provides a complete workflow for high-quality variant calling from Illumina sequencing data. It includes essential preprocessing steps, reference mapping, primary variant calling, and optional advanced processing modules for error correction and quality improvement. The pipeline is optimized for 2x150bp HiSeq 2500 data but can be adapted for other platforms.

Prerequisites

Pipeline Structure

The pipeline is organized into three main sections:

Core Pipeline Stages

Stage 1: Duplicate Removal

# Remove optical and PCR duplicates
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical

Key Points:

Stage 2: Tile-Based Quality Filtering

# Remove reads from low-quality tiles
filterbytile.sh in=clumped.fq.gz out=filtered_by_tile.fq.gz

Requirements: Standard Illumina read headers; will not work with renamed reads from SRA

Stage 3: Adapter Trimming

# Trim adapter sequences
bbduk.sh in=filtered_by_tile.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=100 ref=bbmap/resources/adapters.fa ftm=5 ordered

Parameters:

Stage 4: Contamination Removal

# Remove synthetic artifacts and PhiX
bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=27 ref=bbmap/resources/sequencing_artifacts.fa.gz,bbmap/resources/phix174_ill.ref.fa.gz ordered

Contaminant References:

Stage 5: Reference Mapping

# Map reads to reference genome
bbmap.sh in=filtered.fq.gz out=mapped.sam.gz bs=bs.sh pigz unpigz ref=reference.fa

Mapping Features:

Stage 6: Variant Calling

# Call variants from mapped reads
callvariants.sh in=mapped.sam.gz out=vars.vcf.gz ref=reference.fa ploidy=1 prefilter

Parameters:

Optional Enhancement Module

Advanced processing for improved accuracy through error correction and recalibration:

Recalibration Matrix Generation

# Generate recalibration matrices from initial variant calls
calctruequality.sh in=mapped.sam.gz vcf=vars.vcf.gz

Quality Recalibration

# Apply quality recalibration to reads
bbduk.sh in=filtered.fq.gz out=recal.fq.gz recalibrate ordered

Note: Recalibration applied to unmapped reads for subsequent error correction

Error Correction by Overlap

# Error-correct paired reads using overlap consensus
bbmerge.sh in=recal.fq.gz out=ecco.fq.gz ecco strict mix ordered

Requirements: Paired reads only; single-end reads skip this step

Quality Trimming

# Quality trim after error correction
bbduk.sh in=ecco.fq.gz out=qtrimmed.fq.gz qtrim=r trimq=8 ordered

Re-mapping and Re-calling

# Map error-corrected reads
bbmap.sh in=qtrimmed.fq.gz out=mapped2.sam.gz pigz unpigz

# Call variants with improved data
callvariants.sh in=mapped2.sam.gz out=vars2.vcf.gz ref=reference.fa ploidy=1 prefilter

NovaSeq Error Filtering Module

Special processing for NovaSeq-specific error patterns:

Liberal Initial Variant Calling

# Call variants with liberal settings for error detection
callvariants.sh in=mapped.sam vcf=vars.vcf.gz ref=reference.fa clearfilters minreads=2 ploidy=1 prefilter

Bad Read Filtering

# Remove reads with multiple NovaSeq-specific errors
filtersam.sh in=mapped.sam.gz out=clean.sam.gz outb=dirty.sam.gz vcf=vars.vcf.gz maxbadsubs=2 mbsad=2 mbsrd=2

Parameters:

Final Variant Calling

# Call variants on cleaned reads
callvariants.sh in=clean.sam.gz out=vars2.vcf.gz ref=reference.fa ploidy=1 prefilter

Usage Examples

Basic Variant Calling

# Prepare input files
ln -s sample.fq.gz reads.fq.gz
ln -s genome.fa reference.fa

# Run core pipeline only
# Execute stages 1-6 from the script
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical
filterbytile.sh in=clumped.fq.gz out=filtered_by_tile.fq.gz
bbduk.sh in=filtered_by_tile.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=100 ref=bbmap/resources/adapters.fa ftm=5 ordered
bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=27 ref=bbmap/resources/sequencing_artifacts.fa.gz,bbmap/resources/phix174_ill.ref.fa.gz ordered
bbmap.sh in=filtered.fq.gz out=mapped.sam.gz bs=bs.sh pigz unpigz ref=reference.fa
callvariants.sh in=mapped.sam.gz out=vars.vcf.gz ref=reference.fa ploidy=1 prefilter

Enhanced Processing with Error Correction

# Run core pipeline plus enhancement module
# ... core pipeline steps ...
calctruequality.sh in=mapped.sam.gz vcf=vars.vcf.gz
bbduk.sh in=filtered.fq.gz out=recal.fq.gz recalibrate ordered
bbmerge.sh in=recal.fq.gz out=ecco.fq.gz ecco strict mix ordered
bbduk.sh in=ecco.fq.gz out=qtrimmed.fq.gz qtrim=r trimq=8 ordered
bbmap.sh in=qtrimmed.fq.gz out=mapped2.sam.gz pigz unpigz
callvariants.sh in=mapped2.sam.gz out=vars2.vcf.gz ref=reference.fa ploidy=1 prefilter

NovaSeq Error Handling

# Include NovaSeq-specific error filtering
# ... initial mapping ...
callvariants.sh in=mapped.sam vcf=vars.vcf.gz ref=reference.fa clearfilters minreads=2 ploidy=1 prefilter
filtersam.sh in=mapped.sam.gz out=clean.sam.gz outb=dirty.sam.gz vcf=vars.vcf.gz maxbadsubs=2 mbsad=2 mbsrd=2
callvariants.sh in=clean.sam.gz out=vars2.vcf.gz ref=reference.fa ploidy=1 prefilter

Platform Adaptations

For Different Illumina Platforms

For Paired vs Single-End Data

# For paired files in separate streams
bbduk.sh in1=read1.fq.gz in2=read2.fq.gz out1=clean1.fq.gz out2=clean2.fq.gz [parameters]

# For interleaved paired reads (default assumption)
bbduk.sh in=interleaved.fq.gz out=clean.fq.gz [parameters]

Quality Control Considerations

Essential QC Steps

Platform-Specific Issues

Output Files

Core Pipeline Outputs

Enhanced Processing Outputs

NovaSeq Module Outputs

Performance Optimization

Related Tools