testPlatformQuality.sh

Script: testPlatformQuality.sh Author: Brian Bushnell Last Updated: February 21, 2018 Target: New Illumina Platforms (e.g., NovaSeq)

Comprehensive quality assessment pipeline designed to evaluate new Illumina sequencing platforms such as NovaSeq. Performs systematic analysis of read quality, adapter contamination, mapping accuracy, duplicate rates, variant calling, and quality score recalibration to characterize platform performance.

Purpose

This pipeline provides systematic evaluation of new Illumina sequencing platforms by analyzing key quality metrics including optical duplicate rates, mapping accuracy, variant calling performance, and quality score calibration. The example uses P. heparinus as a test organism but can be adapted for any reference genome system.

Prerequisites

Pipeline Stages

Stage 1: Data Preparation and Demultiplexing

# Optional: Demultiplex by barcodes if needed
# demuxbyname.sh in=all.fq.gz delimiter=space suffixmode out=%.fq.gz

# Link input files
ln -s reads.fq.gz raw.fq.gz
ln -s P.heparinus.fa ref.fa

Stage 2: Contamination Removal

# Remove PhiX contamination
# PhiX interferes with adapter detection due to different adapter sequences
bbduk.sh -Xmx1g in=raw.fq.gz out=filtered.fq.gz ref=phix k=31

Note: PhiX removal is essential unless PhiX is the intended target organism, as it uses different adapters that interfere with standard adapter sequence detection.

Stage 3: Adapter Discovery and Trimming

# Optional: Discover actual adapter sequences
# May fail with insufficient overlapping reads
bbmerge.sh in=filtered.fq.gz outa=adapters.fa reads=1m strict

# Trim adapters using discovered or default sequences
bbduk.sh -Xmx1g in=filtered.fq.gz out=trimmed.fq.gz ref=adapters k=23 mink=11 tbo tpe ktrim=r ow

Adapter Options:

Stage 4: Reference Mapping and Quality Metrics

# Map to reference with comprehensive statistics
bbmap.sh -Xmx8g in=trimmed.fq.gz ref=ref.fa maxindel=2000 slow out=mapped.sam.gz \
    covstats=covstats.txt covhist=covhist.txt 32bit bhist=bhist.txt qhist=qhist.txt \
    aqhist=aqhist.txt bqhist=bqhist.txt lhist=lhist.txt ihist=ihist.txt \
    ehist=ehist.txt qahist=qahist.txt indelhist=indelhist.txt mhist=mhist.txt \
    gchist=gchist.txt idhist=idhist.txt delcov=f ow unpigz=t pigz=t zl=6

Key Parameters:

Stage 5: Duplicate Rate Analysis

# Calculate normal duplicate rates
clumpify.sh groups=1 in=trimmed.fq.gz dedupe 1>dedupe_normal.o 2>&1

# Calculate optical duplicate rates
clumpify.sh groups=1 in=trimmed.fq.gz dedupe optical dist=12k 1>dedupe_optical.o 2>&1

Duplicate Analysis Notes:

Stage 6: Variant Calling and Quality Recalibration

# Call variants for recalibration
callvariants.sh in=mapped.sam.gz ref=ref.fa out=vars.txt vcf=vars.vcf ploidy=1 -Xmx8g ow

# Generate recalibration matrices
calctruequality.sh in=mapped.sam.gz vcf=vars.vcf -Xmx8g

# Recalibrate quality scores
bbduk.sh ow in=mapped.sam.gz out=recal.sam.gz recalibrate -Xmx8g ow

Stage 7: Quality Assessment and Validation

# Generate quality statistics ignoring real variants
bbduk.sh ow in=mapped.sam.gz bhist=bhist.txt qhist=qhist.txt aqhist=aqhist.txt \
    bqhist=bqhist.txt ehist=ehist.txt qahist=qahist.txt indelhist=indelhist.txt \
    mhist=mhist.txt idhist=idhist.txt -Xmx8g vcf=vars.vcf ow

# Generate recalibrated statistics
bbduk.sh ow in=recal.sam.gz qhist=qhist2.txt aqhist=aqhist2.txt bqhist=bqhist2.txt \
    qahist=qahist2.txt mhist=mhist2.txt idhist=idhist2.txt -Xmx8g vcf=vars.vcf ow

Stage 8: Contamination and Composition Analysis

# Identify library composition
sendsketch.sh in=trimmed.fq.gz reads=500k

# Map to comprehensive reference set for contamination assessment
seal.sh -Xmx31g in=trimmed.fq.gz ref=all_fused.fa stats=sealstats.txt ambig=toss clearzone=10 ow

Output Files and Metrics

Quality Assessment Files

Mapping and Coverage Files

Duplicate Analysis Files

Variant and Recalibration Files

Contamination Analysis Files

Usage Example

# Prepare input files
ln -s novaseq_test_reads.fq.gz reads.fq.gz
ln -s reference_genome.fa P.heparinus.fa

# Run the complete pipeline
./testPlatformQuality.sh

# Analyze results
echo "=== Quality Score Accuracy ==="
head -20 qahist.txt qahist2.txt

echo "=== Duplicate Rates ==="
grep "Duplicates:" dedupe_normal.o dedupe_optical.o

echo "=== Mapping Statistics ==="
grep "mapped:" mapped.sam.gz.stdout

Platform Evaluation Metrics

Quality Score Accuracy

Duplicate Rate Assessment

Mapping and Variant Performance

Customization Notes

Interpretation Guidelines

Related Tools