Poly-G Artifact-Free Isolate Assembly Pipeline
State-of-the-art isolate assembly pipeline specifically designed to eliminate poly-G artifacts while producing high-quality genome assemblies. Includes comprehensive preprocessing, quality recalibration, advanced poly-G filtering, error correction, read extension, and assembly with Spades.
Overview
This pipeline represents the latest advancement in addressing poly-G artifacts that can severely impact genome assembly quality, particularly with newer Illumina sequencing chemistry. It combines traditional preprocessing steps with cutting-edge poly-G detection and removal techniques, quality score recalibration, and optimized assembly strategies for isolate genomes.
Prerequisites
System Requirements
- Perlmutter HPC system access
- BBTools development version (/global/cfs/cdirs/bbtools/jgi-bbtools/)
- Spades assembler v4.0.0 (via Shifter container)
- Mouse/Cat/Dog/Human contamination database
- 64 CPU cores recommended
- 48GB RAM (for login nodes) or 85% of requested memory for jobs
Input Requirements
- Raw (unfiltered) Illumina sequencing reads
- Sufficient coverage for quality recalibration and assembly
- Reads should be linked as "raw.fq.gz"
Configuration Variables
CORES=64 # CPU cores to use
ZL=6 # Compression level
HIGHRAM=31g # High memory allocation
MAXRAM=48g # Maximum memory (adjust for scheduled jobs)
MCDH=/global/cfs/cdirs/bbtools/RQCFilterData_Local/mousecatdoghuman/
These variables control resource allocation and should be adjusted based on your specific job requirements and system configuration.
Pipeline Stages
1. Initial Preprocessing
1.1 Adapter Detection and Trimming
# Detect adapter sequences
bbmerge.sh -Xmx4g in=raw.fq.gz outa=adapters.fa
# Trim adapters with sophisticated parameters
bbduk.sh -Xmx4g in=raw.fq.gz out=raw_trimmed.fq.gz tbo tpe hdist=2 k=23 mink=9 hdist2=1 ref=adapters.fa minlen=135 ktrim=r
Auto-detects adapter sequences from the data itself, then performs precision trimming with dual hamming distance parameters for different k-mer sizes.
1.2 Advanced Deduplication
clumpify.sh -Xmx48g in=raw_trimmed.fq.gz out=deduped.fq.gz passes=4 groups=1 dedupe optical dist=50
Removes optical duplicates and other PCR duplicates using multi-pass processing with optimized distance parameters.
1.3 Artifact Removal
bbduk.sh -Xmx4g ref=artifacts,phix literal=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA in=deduped.fq.gz k=31 hdist=1 out=filtered.fq.gz
Removes known sequencing artifacts, PhiX contamination, and poly-A sequences that might interfere with assembly.
1.4 Human Contamination Removal
bbsplit.sh -Xmx48g deterministic ordered=false k=14 usemodulo printunmappedcount kfilter=25 maxsites=1 tipsearch=0 minratio=.9 maxindel=3 minhits=2 bw=12 bwr=0.16 fast maxsites2=10 build=1 ef=0.03 bloomfilter bloomk=29 bloomhashes=1 bloomminhits=6 bloomserial path="$MCDH" refstats=refStats.txt forcereadonly in=filtered.fq.gz out=clean.fq.gz outm=human.fq.gz
Sophisticated contamination removal targeting mouse, cat, dog, and human sequences using bloom filter acceleration and optimized alignment parameters.
2. Quality Recalibration System
2.1 Quick Assembly for Recalibration
tadpole.sh -Xmx31g in=clean.fq.gz out=qecc.fq.gz k=62 merge wash ecc tossjunk tu ldf=0.4 tossdepth=1 aecc
tadpole.sh -Xmx31g in=qecc.fq.gz out=quick.fa k=124 mcs=5 mce=4 merge
Creates a quick assembly for use as a reference in quality score recalibration, with aggressive junk removal and error correction.
2.2 Read Mapping and Variant Calling
bbmap.sh -Xmx31g in=clean.fq.gz out=clean.sam.gz vslow maxindel=40 ref=quick.fa
Maps cleaned reads to the quick assembly with very slow, high-accuracy alignment for precise variant detection.
2.3 Quality Score Recalibration
calctruequality.sh -Xmx31g in=clean.sam.gz usetiles ref=quick.fa callvars
bbduk.sh -Xmx4g in=clean.fq.gz out=clean_recal_tile.fq.gz recalibrate usetiles
Calculates true quality scores based on variant calls and tile position, then recalibrates quality scores for improved downstream processing.
2.4 Tile-Based Quality Filtering
filterbytile.sh -Xmx31g in=clean_recal_tile.fq.gz out=fbt_recal_tile.fq.gz lowqualityonly=t
Removes reads from low-quality flowcell tiles using the recalibrated quality information.
3. Poly-G Artifact Removal
3.1 Primary Poly-G Filtering
polyfilter.sh -Xmx31g in=fbt_recal_tile.fq.gz out=polyfilter_fbt_recal_tile.fq.gz
State-of-the-art poly-G artifact detection and removal using advanced algorithms that can distinguish true poly-G sequences from artifacts.
4. Isolate-Specific Assembly Preprocessing
4.1 Error Correction Phase 1
bbmerge.sh -Xmx31g in=hdist2.fq.gz out=ecco.fq.gz ecco mix adapters=adapters.fa kfilter=1 k=31
Overlap-based error correction using detected adapter sequences for improved accuracy.
4.2 Error Correction Phase 2
tadpole.sh -Xmx31g in=ecco.fq.gz out=ecct.fq.gz ecc k=62 wash tu tossdepth=1 ldf=0.15
K-mer based error correction with aggressive depth filtering optimized for isolate genomes.
4.3 Multi-Stage Read Merging
# Initial merging
bbmerge.sh -Xmx31g in=ecct.fq.gz outm=merged0.fq.gz outu=unmerged0.fq.gz kfilter=1 adapters=adapters.fa
# Iterative merging with kmer extension
bbmerge.sh -Xmx31g in=unmerged0.fq.gz extra=merged0.fq.gz out=merged_rem.fq.gz outu=unmerged_rem.fq.gz rem k=124 extend2=120
bbmerge.sh -Xmx31g in=unmerged_rem.fq.gz extra=merged0.fq.gz,merged_rem.fq.gz out=merged_rem2.fq.gz outu=unmerged_rem2.fq.gz rem k=145 extend2=140
bbmerge.sh -Xmx31g in=unmerged_rem2.fq.gz extra=merged0.fq.gz,merged_rem.fq.gz,merged_rem2.fq.gz out=merged_rem3.fq.gz outu=unmerged_rem3.fq.gz rem k=93 extend2=100 strict
Sophisticated multi-stage merging with k-mer extension to maximize the number of merged reads and improve assembly contiguity.
4.4 Read Processing and Extension
# Combine merged reads
zcat merged0.fq.gz merged_rem.fq.gz merged_rem2.fq.gz merged_rem3.fq.gz | reformat.sh -Xmx4g in=stdin.fq int=f out=merged_both.fq.gz
# Quality trim unmerged reads
bbduk.sh -Xmx4g in=unmerged_rem3.fq.gz out=qtrimmed.fq.gz qtrim=r trimq=15 cardinality cardinalityout maq=14 minlen=90 ftr=149 maxns=1
# Extend reads using overlaps
tadpole.sh -Xmx31g in=merged_both.fq.gz extra=qtrimmed.fq.gz out=merged_ext1.fq.gz mode=extend mce=4 er=10 el=10 k=145
tadpole.sh -Xmx31g in=qtrimmed.fq.gz extra=merged_both.fq.gz out=unmerged_ext1.fq.gz mode=extend mce=4 el=10 k=124
Advanced read extension using k-mer overlap information to increase read length and improve assembly quality.
5. Assembly with Spades
5.1 Optimized Spades Assembly
shifter --image=staphb/spades:4.0.0 spades.py -t 64 -k 25,55,95,127 --phred-offset 33 --only-assembler --isolate --pe-m 1 merged_ext1.fq.gz --pe-12 1 unmerged_ext1.fq.gz -o spades_out
High-performance Spades assembly using containerized environment with isolate-specific optimizations and extended reads.
6. Assembly Validation
6.1 Poly-G Contamination Check
bbduk.sh -Xmx4g in=spades_out/contigs.fasta literal=GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG hdist=2 k=25
Validates that the assembly is free from residual poly-G contamination by searching for long poly-G sequences.
6.2 Assembly Statistics
stats.sh in=spades_out/contigs.fasta
Generates comprehensive assembly statistics to evaluate contiguity and quality metrics.
Basic Usage
# 1. Link your raw reads (must be unfiltered)
ln -s path/to/your/reads.fq.gz raw.fq.gz
# 2. Adjust memory settings if running as scheduled job
# Edit MAXRAM variable to 85% of requested memory
# 3. Run the pipeline
bash assemble_polyg_isolate_v1.sh
# 4. Check results in spades_out/contigs.fasta
Key Innovations
Poly-G Artifact Removal
This pipeline includes the latest polyfilter.sh tool which uses sophisticated algorithms to:
- Distinguish between genuine poly-G sequences and sequencing artifacts
- Remove problematic reads while preserving legitimate sequences
- Handle complex poly-G patterns that simple trimming cannot address
Quality Score Recalibration
Advanced quality score recalibration system that:
- Creates a quick assembly for variant calling
- Analyzes tile-specific quality patterns
- Recalibrates quality scores based on observed error rates
- Improves downstream error correction and assembly quality
Multi-Stage Read Merging
Sophisticated merging strategy that:
- Performs multiple rounds of merging with different parameters
- Uses k-mer extension to merge previously unmergeable reads
- Maximizes the number of merged reads for better assembly
Read Extension Technology
Uses tadpole's extend mode to:
- Lengthen reads using k-mer overlap information
- Improve assembly contiguity
- Reduce gaps and improve scaffolding
Output Files
- spades_out/contigs.fasta - Final assembly contigs
- adapters.fa - Auto-detected adapter sequences
- clean.fq.gz - Contamination-free reads
- human.fq.gz - Removed human contamination
- refStats.txt - Contamination removal statistics
- quick.fa - Quick assembly for recalibration
- clean.sam.gz - Mapping results for recalibration
- merged_ext1.fq.gz - Extended merged reads
- unmerged_ext1.fq.gz - Extended unmerged reads
- spades.o - Spades assembly log
Memory and Performance
Memory Configuration
- MAXRAM (48g): For memory-intensive steps, adjust for your system
- HIGHRAM (31g): For high-memory operations
- LOWRAM (4g): For lightweight operations
Performance Optimization
- 64 cores: Fully utilizes modern HPC systems
- Compression level 6: Balances speed and disk usage
- Containerized Spades: Ensures reproducibility and latest version
Troubleshooting
- Memory errors: Adjust MAXRAM based on available system memory
- Poly-G contamination detected: Check if polyfilter.sh completed successfully
- Low assembly quality: Verify input data quality and coverage
- Spades failures: Check spades.o log file for detailed error messages
- Path not found errors: Ensure BBTools dev version path is correct
- Contamination database errors: Verify MCDH path exists and is accessible
Version History
- v1.1 (October 30, 2024): Latest version with improved poly-G filtering
- v1.0: Initial release of poly-G artifact-free pipeline