testPlatformQuality.sh
Comprehensive quality assessment pipeline designed to evaluate new Illumina sequencing platforms such as NovaSeq. Performs systematic analysis of read quality, adapter contamination, mapping accuracy, duplicate rates, variant calling, and quality score recalibration to characterize platform performance.
Purpose
This pipeline provides systematic evaluation of new Illumina sequencing platforms by analyzing key quality metrics including optical duplicate rates, mapping accuracy, variant calling performance, and quality score calibration. The example uses P. heparinus as a test organism but can be adapted for any reference genome system.
Prerequisites
- Input Data: Interleaved paired-end reads from the platform being tested
- Reference Genome: High-quality reference (P. heparinus in example)
- System Resources: 8GB+ RAM, moderate CPU requirements
- Optional: Pre-existing adapter sequences, or pipeline can discover them
Pipeline Stages
Stage 1: Data Preparation and Demultiplexing
# Optional: Demultiplex by barcodes if needed
# demuxbyname.sh in=all.fq.gz delimiter=space suffixmode out=%.fq.gz
# Link input files
ln -s reads.fq.gz raw.fq.gz
ln -s P.heparinus.fa ref.fa
Stage 2: Contamination Removal
# Remove PhiX contamination
# PhiX interferes with adapter detection due to different adapter sequences
bbduk.sh -Xmx1g in=raw.fq.gz out=filtered.fq.gz ref=phix k=31
Note: PhiX removal is essential unless PhiX is the intended target organism, as it uses different adapters that interfere with standard adapter sequence detection.
Stage 3: Adapter Discovery and Trimming
# Optional: Discover actual adapter sequences
# May fail with insufficient overlapping reads
bbmerge.sh in=filtered.fq.gz outa=adapters.fa reads=1m strict
# Trim adapters using discovered or default sequences
bbduk.sh -Xmx1g in=filtered.fq.gz out=trimmed.fq.gz ref=adapters k=23 mink=11 tbo tpe ktrim=r ow
Adapter Options:
ref=adapters
- Uses BBMap's default adapter setref=adapters.fa
- Uses custom discovered adapters (if file looks reasonable)- Custom adapters may need trailing poly-A or poly-C trimming
Stage 4: Reference Mapping and Quality Metrics
# Map to reference with comprehensive statistics
bbmap.sh -Xmx8g in=trimmed.fq.gz ref=ref.fa maxindel=2000 slow out=mapped.sam.gz \
covstats=covstats.txt covhist=covhist.txt 32bit bhist=bhist.txt qhist=qhist.txt \
aqhist=aqhist.txt bqhist=bqhist.txt lhist=lhist.txt ihist=ihist.txt \
ehist=ehist.txt qahist=qahist.txt indelhist=indelhist.txt mhist=mhist.txt \
gchist=gchist.txt idhist=idhist.txt delcov=f ow unpigz=t pigz=t zl=6
Key Parameters:
maxindel=2000
- Allows large indels for comprehensive alignmentslow
- High-sensitivity mapping mode32bit
- Enables processing of very large references- Generates 12+ histogram files for quality assessment
Stage 5: Duplicate Rate Analysis
# Calculate normal duplicate rates
clumpify.sh groups=1 in=trimmed.fq.gz dedupe 1>dedupe_normal.o 2>&1
# Calculate optical duplicate rates
clumpify.sh groups=1 in=trimmed.fq.gz dedupe optical dist=12k 1>dedupe_optical.o 2>&1
Duplicate Analysis Notes:
optical dist=12k
- Detects optical duplicates within 12k distance- Fair comparisons may require subsampling to fixed read numbers
- "First X reads" approach important for optical/tile-edge duplicate detection
- Different sampling approaches yield different results
Stage 6: Variant Calling and Quality Recalibration
# Call variants for recalibration
callvariants.sh in=mapped.sam.gz ref=ref.fa out=vars.txt vcf=vars.vcf ploidy=1 -Xmx8g ow
# Generate recalibration matrices
calctruequality.sh in=mapped.sam.gz vcf=vars.vcf -Xmx8g
# Recalibrate quality scores
bbduk.sh ow in=mapped.sam.gz out=recal.sam.gz recalibrate -Xmx8g ow
Stage 7: Quality Assessment and Validation
# Generate quality statistics ignoring real variants
bbduk.sh ow in=mapped.sam.gz bhist=bhist.txt qhist=qhist.txt aqhist=aqhist.txt \
bqhist=bqhist.txt ehist=ehist.txt qahist=qahist.txt indelhist=indelhist.txt \
mhist=mhist.txt idhist=idhist.txt -Xmx8g vcf=vars.vcf ow
# Generate recalibrated statistics
bbduk.sh ow in=recal.sam.gz qhist=qhist2.txt aqhist=aqhist2.txt bqhist=bqhist2.txt \
qahist=qahist2.txt mhist=mhist2.txt idhist=idhist2.txt -Xmx8g vcf=vars.vcf ow
Stage 8: Contamination and Composition Analysis
# Identify library composition
sendsketch.sh in=trimmed.fq.gz reads=500k
# Map to comprehensive reference set for contamination assessment
seal.sh -Xmx31g in=trimmed.fq.gz ref=all_fused.fa stats=sealstats.txt ambig=toss clearzone=10 ow
Output Files and Metrics
Quality Assessment Files
- qhist.txt / qhist2.txt: Quality score distributions (before/after recalibration)
- aqhist.txt / aqhist2.txt: Accuracy vs quality histograms
- bhist.txt: Base composition histograms
- mhist.txt / mhist2.txt: Match/mismatch distributions
- idhist.txt / idhist2.txt: Identity distributions
Mapping and Coverage Files
- covstats.txt: Coverage statistics per scaffold
- covhist.txt: Coverage depth histograms
- lhist.txt: Read length distributions
- ihist.txt: Insert size distributions
- indelhist.txt: Indel size distributions
- gchist.txt: GC content distributions
Duplicate Analysis Files
- dedupe_normal.o: Standard duplicate rate analysis
- dedupe_optical.o: Optical duplicate rate analysis
Variant and Recalibration Files
- vars.txt / vars.vcf: Called variants for recalibration
- recal.sam.gz: Quality-recalibrated alignments
Contamination Analysis Files
- SendSketch output: Taxonomic composition analysis
- sealstats.txt: Contamination mapping statistics
Usage Example
# Prepare input files
ln -s novaseq_test_reads.fq.gz reads.fq.gz
ln -s reference_genome.fa P.heparinus.fa
# Run the complete pipeline
./testPlatformQuality.sh
# Analyze results
echo "=== Quality Score Accuracy ==="
head -20 qahist.txt qahist2.txt
echo "=== Duplicate Rates ==="
grep "Duplicates:" dedupe_normal.o dedupe_optical.o
echo "=== Mapping Statistics ==="
grep "mapped:" mapped.sam.gz.stdout
Platform Evaluation Metrics
Quality Score Accuracy
- Pre-recalibration: Compare reported vs actual quality scores
- Post-recalibration: Assess improvement in quality score accuracy
- Quality trends: Analyze quality degradation patterns
Duplicate Rate Assessment
- PCR Duplicates: Overall duplication due to amplification
- Optical Duplicates: Platform-specific cluster duplication
- Tile Effects: Spatial patterns in duplication rates
Mapping and Variant Performance
- Mapping Accuracy: Alignment quality and coverage uniformity
- Variant Detection: SNP/indel calling accuracy
- Systematic Errors: Platform-specific error patterns
Customization Notes
- Reference Selection: Choose appropriate reference organism for your evaluation
- Memory Scaling: Adjust -Xmx flags based on available system memory
- Sample Size: Consider subsampling for fair platform comparisons
- Adapter Sequences: Platform-specific adapters may require customization
- Quality Thresholds: Adjust parameters based on expected platform characteristics
Interpretation Guidelines
- Quality Recalibration: Effective recalibration indicates systematic quality score biases
- Duplicate Patterns: High optical duplicates may indicate clustering issues
- Error Profiles: Systematic error patterns reveal platform-specific characteristics
- Contamination Levels: Unexpected organisms may indicate cross-contamination
- Coverage Uniformity: Uneven coverage may indicate sequence bias
Related Tools
bbduk.sh
- Quality control, filtering, and recalibrationbbmap.sh
- High-sensitivity mapping with comprehensive statisticsbbmerge.sh
- Adapter discovery through read overlap analysisclumpify.sh
- Duplicate detection and removalcallvariants.sh
- Variant calling for quality assessmentcalctruequality.sh
- Quality score recalibration matrix generationsendsketch.sh
- Rapid taxonomic identificationseal.sh
- Comprehensive contamination screening