process_lane_v007.sh
Comprehensive NovaSeq full-lane analysis pipeline that gathers quality metrics, performs PhiX-based recalibration, and generates barcode counts for downstream per-library processing. Designed as a non-destructive analysis tool that preserves original data while generating essential quality control files.
Purpose
This pipeline analyzes complete Illumina sequencing lanes to generate quality metrics and recalibration data without modifying the original sequencing files. It produces standardized output files (PHIX, TILEDUMP, QHIST, COUNTS) that integrate with the Jamo system for downstream per-library processing and quality assessment.
Prerequisites
- PhiX Spike-in: Requires PhiX control sequences in the lane (minimum 0.1%, recommended 1%+)
- BBTools Version: Requires BBTools v39.09 or later
- System Resources:
- 64 physical CPU cores (configurable)
- 48GB RAM for login nodes (85% of physical RAM recommended)
- High-speed scratch storage for temporary files
- Input Data: Raw Illumina lane FASTQ file
- Optional: expected.txt file containing lane barcode list for Jamo integration
Configuration Variables
The pipeline uses environment variables that must be configured before execution:
Lane Configuration
# Lane identification and file paths
LANEID=ABXYZ # Lane-specific identifier
RAW=ABXYZ.1.fq.gz # Lane fastq filename
RAWPATH=/foo/bar/"$RAW" # Full input path
OUT="$PSCRATCH"/"$LANEID" # Output directory for large files
System Resources
# Hardware configuration
CORES=64 # Physical CPU cores
ZL=9 # Compression level (4 if bgzip unavailable)
MAXRAM=48g # 85% of physical RAM
HIGHRAM=31g # High memory operations
LOWRAM=4g # Low memory operations
Pipeline Stages
Stage 1: PhiX Isolation and Processing
# Filter PhiX reads from raw lane data
bbduk.sh "$LOW" "$ARGS" ref=phix k=25 hdist=2 in="$RAWPATH" outm=phix.fq.gz
# Adapter trim PhiX reads for accurate alignment
bbduk.sh "$LOW" "$ARGS" in=phix.fq.gz out=phix_trimmed.fq.gz ref=adapters k=23 mink=11 hdist=2 hdist2=0 tbo tpe ktrim=r minlen=100 ordered
Parameters:
k=25
- Kmer size for PhiX detectionhdist=2
- Allow up to 2 mismatches for robust PhiX detectionminlen=100
- Minimum read length after trimming
Stage 2: PhiX Alignment and Quality Analysis
# Align PhiX reads with comprehensive quality metrics
bbmap.sh "$HIGH" "$ARGS" ref=phix nodisk vslow maxindel=100 in=phix_trimmed.fq.gz outm=phix.sam.gz qhist="$QHIST" qahist=qahist.txt mhist=mhist.txt bhist=bhist.txt ordered
Key Features:
vslow
- Most sensitive alignment mode for accurate mappingmaxindel=100
- Allows large indels for comprehensive alignmentnodisk
- Keeps reference in memory for speed- Generates multiple histogram files for quality assessment
Stage 3: Quality Recalibration Matrix Generation
# Calculate true quality scores using PhiX alignments
calctruequality.sh "$HIGH" "$ARGS" in="$PHIX" usetiles callvars ref=phix
Features:
usetiles
- Generates per-tile recalibration matricescallvars
- Identifies variants for accurate quality calculation- Uses PhiX known sequence as ground truth reference
Stage 4: Lane-wide Quality Recalibration
# Apply recalibration to entire lane
bbduk.sh "$LOW" "$ARGS" in="$RAWPATH" out="$RECAL" recalibrate usetiles
Applies calculated recalibration matrices to the complete lane data, correcting systematic quality score biases.
Stage 5: Tile Quality Assessment
# Analyze per-tile quality after recalibration
filterbytile.sh "$MAX" "$ARGS" in="$RECAL" dump="$TILEDUMP"
Identifies problematic tiles and generates comprehensive tile quality metrics. This step is most effective when performed on recalibrated data for the complete lane.
Stage 6: Barcode Quantification
# Count all barcodes in the lane
countbarcodes2.sh "$HIGH" "$ARGS" in="$RAWPATH" counts="$COUNTS"
Generates comprehensive barcode counts for downstream demultiplexing and library quantification.
Output Files
The pipeline generates four essential files for Jamo system integration:
Required Output Files
- PHIX (phix.sam.gz): PhiX alignments for future recalibration
- TILEDUMP (tiledump.txt.gz): Per-tile quality metrics and filtering recommendations
- QHIST (qhist.txt): Quality score histograms for lane assessment
- COUNTS (barcodecounts.txt.gz): Comprehensive barcode quantification
Additional Quality Files
qahist.txt
- Quality vs accuracy histogramsmhist.txt
- Match/mismatch histogramsbhist.txt
- Base composition histograms
Temporary Files
$RECAL
- Recalibrated lane data (temporary, large)phix.fq.gz
- Extracted PhiX readsphix_trimmed.fq.gz
- Adapter-trimmed PhiX reads
Usage Example
# Configure environment variables
export LANEID=NovaSeq_001_Lane1
export RAW=NS001_L1.fastq.gz
export RAWPATH=/data/raw/"$RAW"
export PSCRATCH=/scratch/analysis
# Run the pipeline
./process_lane_v007.sh
# Check completion
if [ -f finished ]; then
echo "Pipeline completed successfully"
ls -la phix.sam.gz tiledump.txt.gz qhist.txt barcodecounts.txt.gz
fi
Jamo System Integration
After pipeline completion, these files must be uploaded to Jamo and associated with the lane:
- phix.sam.gz - Enable future recalibration workflows
- tiledump.txt.gz - Tile quality analysis and filtering
- qhist.txt - Lane-wide quality assessment
- barcodecounts.txt.gz - Demultiplexing and quantification
- expected.txt - List of expected barcodes (should already be in Jamo)
- samplemap.txt - Alternative: barcode to sample mapping file
Performance Characteristics
- Memory Scaling: Designed for systems with 48GB+ RAM
- CPU Utilization: Efficiently uses 64+ core systems
- I/O Pattern: Sequential reads with temporary scratch space usage
- Runtime: Proportional to lane size, typically hours for NovaSeq lanes
- Non-destructive: Original lane data remains unmodified
Platform Compatibility
- Primary Target: NovaSeq sequencing platforms
- NERSC Systems: Optimized for Perlmutter and similar HPC environments
- Alternative Systems: Adaptable to other high-performance systems
- Container Support: Compatible with containerized BBTools deployments
Quality Control Notes
- PhiX Dependency: Pipeline effectiveness directly correlates with PhiX spike-in percentage
- Tile Analysis: Most effective when performed post-recalibration on complete lanes
- Completion Marker: 'finished' file indicates successful pipeline completion
- Error Handling: Pipeline uses 'set -e' for immediate failure on errors
Related Tools
bbduk.sh
- Quality control and recalibrationbbmap.sh
- PhiX alignment and quality metricscalctruequality.sh
- Recalibration matrix generationfilterbytile.sh
- Tile-based quality analysiscountbarcodes2.sh
- Barcode quantification