Poly-G Artifact-Free Isolate Assembly Pipeline

Overview

This pipeline represents the latest advancement in addressing poly-G artifacts that can severely impact genome assembly quality, particularly with newer Illumina sequencing chemistry. It combines traditional preprocessing steps with cutting-edge poly-G detection and removal techniques, quality score recalibration, and optimized assembly strategies for isolate genomes.

Important: This pipeline is optimized for the Perlmutter HPC system and requires specific path configurations. Memory settings are configured for login nodes and should be adjusted for scheduled jobs (set MAXRAM to 85% of requested memory).

Prerequisites

System Requirements

Perlmutter HPC system access
BBTools development version (/global/cfs/cdirs/bbtools/jgi-bbtools/)
Spades assembler v4.0.0 (via Shifter container)
Mouse/Cat/Dog/Human contamination database
64 CPU cores recommended
48GB RAM (for login nodes) or 85% of requested memory for jobs

Input Requirements

Raw (unfiltered) Illumina sequencing reads
Sufficient coverage for quality recalibration and assembly
Reads should be linked as "raw.fq.gz"

Configuration Variables

CORES=64                    # CPU cores to use
ZL=6                        # Compression level
HIGHRAM=31g                 # High memory allocation
MAXRAM=48g                  # Maximum memory (adjust for scheduled jobs)
MCDH=/global/cfs/cdirs/bbtools/RQCFilterData_Local/mousecatdoghuman/

These variables control resource allocation and should be adjusted based on your specific job requirements and system configuration.

Pipeline Stages

1. Initial Preprocessing

1.1 Adapter Detection and Trimming

# Detect adapter sequences
bbmerge.sh -Xmx4g in=raw.fq.gz outa=adapters.fa

# Trim adapters with sophisticated parameters
bbduk.sh -Xmx4g in=raw.fq.gz out=raw_trimmed.fq.gz tbo tpe hdist=2 k=23 mink=9 hdist2=1 ref=adapters.fa minlen=135 ktrim=r

Auto-detects adapter sequences from the data itself, then performs precision trimming with dual hamming distance parameters for different k-mer sizes.

1.2 Advanced Deduplication

clumpify.sh -Xmx48g in=raw_trimmed.fq.gz out=deduped.fq.gz passes=4 groups=1 dedupe optical dist=50

Removes optical duplicates and other PCR duplicates using multi-pass processing with optimized distance parameters.

1.3 Artifact Removal

bbduk.sh -Xmx4g ref=artifacts,phix literal=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA in=deduped.fq.gz k=31 hdist=1 out=filtered.fq.gz

Removes known sequencing artifacts, PhiX contamination, and poly-A sequences that might interfere with assembly.

1.4 Human Contamination Removal

bbsplit.sh -Xmx48g deterministic ordered=false k=14 usemodulo printunmappedcount kfilter=25 maxsites=1 tipsearch=0 minratio=.9 maxindel=3 minhits=2 bw=12 bwr=0.16 fast maxsites2=10 build=1 ef=0.03 bloomfilter bloomk=29 bloomhashes=1 bloomminhits=6 bloomserial path="$MCDH" refstats=refStats.txt forcereadonly in=filtered.fq.gz out=clean.fq.gz outm=human.fq.gz

Sophisticated contamination removal targeting mouse, cat, dog, and human sequences using bloom filter acceleration and optimized alignment parameters.

2. Quality Recalibration System

2.1 Quick Assembly for Recalibration

tadpole.sh -Xmx31g in=clean.fq.gz out=qecc.fq.gz k=62 merge wash ecc tossjunk tu ldf=0.4 tossdepth=1 aecc
tadpole.sh -Xmx31g in=qecc.fq.gz out=quick.fa k=124 mcs=5 mce=4 merge

Creates a quick assembly for use as a reference in quality score recalibration, with aggressive junk removal and error correction.

2.2 Read Mapping and Variant Calling

bbmap.sh -Xmx31g in=clean.fq.gz out=clean.sam.gz vslow maxindel=40 ref=quick.fa

Maps cleaned reads to the quick assembly with very slow, high-accuracy alignment for precise variant detection.

2.3 Quality Score Recalibration

calctruequality.sh -Xmx31g in=clean.sam.gz usetiles ref=quick.fa callvars
bbduk.sh -Xmx4g in=clean.fq.gz out=clean_recal_tile.fq.gz recalibrate usetiles

Calculates true quality scores based on variant calls and tile position, then recalibrates quality scores for improved downstream processing.

2.4 Tile-Based Quality Filtering

filterbytile.sh -Xmx31g in=clean_recal_tile.fq.gz out=fbt_recal_tile.fq.gz lowqualityonly=t

Removes reads from low-quality flowcell tiles using the recalibrated quality information.

3. Poly-G Artifact Removal

3.1 Primary Poly-G Filtering

polyfilter.sh -Xmx31g in=fbt_recal_tile.fq.gz out=polyfilter_fbt_recal_tile.fq.gz

State-of-the-art poly-G artifact detection and removal using advanced algorithms that can distinguish true poly-G sequences from artifacts.

4. Isolate-Specific Assembly Preprocessing

4.1 Error Correction Phase 1

bbmerge.sh -Xmx31g in=hdist2.fq.gz out=ecco.fq.gz ecco mix adapters=adapters.fa kfilter=1 k=31

Overlap-based error correction using detected adapter sequences for improved accuracy.

4.2 Error Correction Phase 2

tadpole.sh -Xmx31g in=ecco.fq.gz out=ecct.fq.gz ecc k=62 wash tu tossdepth=1 ldf=0.15

K-mer based error correction with aggressive depth filtering optimized for isolate genomes.

4.3 Multi-Stage Read Merging

# Initial merging
bbmerge.sh -Xmx31g in=ecct.fq.gz outm=merged0.fq.gz outu=unmerged0.fq.gz kfilter=1 adapters=adapters.fa

# Iterative merging with kmer extension
bbmerge.sh -Xmx31g in=unmerged0.fq.gz extra=merged0.fq.gz out=merged_rem.fq.gz outu=unmerged_rem.fq.gz rem k=124 extend2=120
bbmerge.sh -Xmx31g in=unmerged_rem.fq.gz extra=merged0.fq.gz,merged_rem.fq.gz out=merged_rem2.fq.gz outu=unmerged_rem2.fq.gz rem k=145 extend2=140
bbmerge.sh -Xmx31g in=unmerged_rem2.fq.gz extra=merged0.fq.gz,merged_rem.fq.gz,merged_rem2.fq.gz out=merged_rem3.fq.gz outu=unmerged_rem3.fq.gz rem k=93 extend2=100 strict

Sophisticated multi-stage merging with k-mer extension to maximize the number of merged reads and improve assembly contiguity.

4.4 Read Processing and Extension

# Combine merged reads
zcat merged0.fq.gz merged_rem.fq.gz merged_rem2.fq.gz merged_rem3.fq.gz | reformat.sh -Xmx4g in=stdin.fq int=f out=merged_both.fq.gz

# Quality trim unmerged reads
bbduk.sh -Xmx4g in=unmerged_rem3.fq.gz out=qtrimmed.fq.gz qtrim=r trimq=15 cardinality cardinalityout maq=14 minlen=90 ftr=149 maxns=1

# Extend reads using overlaps
tadpole.sh -Xmx31g in=merged_both.fq.gz extra=qtrimmed.fq.gz out=merged_ext1.fq.gz mode=extend mce=4 er=10 el=10 k=145
tadpole.sh -Xmx31g in=qtrimmed.fq.gz extra=merged_both.fq.gz out=unmerged_ext1.fq.gz mode=extend mce=4 el=10 k=124

Advanced read extension using k-mer overlap information to increase read length and improve assembly quality.

5. Assembly with Spades

5.1 Optimized Spades Assembly

shifter --image=staphb/spades:4.0.0 spades.py -t 64 -k 25,55,95,127 --phred-offset 33 --only-assembler --isolate --pe-m 1 merged_ext1.fq.gz --pe-12 1 unmerged_ext1.fq.gz -o spades_out

High-performance Spades assembly using containerized environment with isolate-specific optimizations and extended reads.

6. Assembly Validation

6.1 Poly-G Contamination Check

bbduk.sh -Xmx4g in=spades_out/contigs.fasta literal=GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG hdist=2 k=25

Validates that the assembly is free from residual poly-G contamination by searching for long poly-G sequences.

6.2 Assembly Statistics

stats.sh in=spades_out/contigs.fasta

Generates comprehensive assembly statistics to evaluate contiguity and quality metrics.

Basic Usage

# 1. Link your raw reads (must be unfiltered)
ln -s path/to/your/reads.fq.gz raw.fq.gz

# 2. Adjust memory settings if running as scheduled job
# Edit MAXRAM variable to 85% of requested memory

# 3. Run the pipeline
bash assemble_polyg_isolate_v1.sh

# 4. Check results in spades_out/contigs.fasta

Key Innovations

Poly-G Artifact Removal

This pipeline includes the latest polyfilter.sh tool which uses sophisticated algorithms to:

Distinguish between genuine poly-G sequences and sequencing artifacts
Remove problematic reads while preserving legitimate sequences
Handle complex poly-G patterns that simple trimming cannot address

Quality Score Recalibration

Advanced quality score recalibration system that:

Creates a quick assembly for variant calling
Analyzes tile-specific quality patterns
Recalibrates quality scores based on observed error rates
Improves downstream error correction and assembly quality

Multi-Stage Read Merging

Sophisticated merging strategy that:

Performs multiple rounds of merging with different parameters
Uses k-mer extension to merge previously unmergeable reads
Maximizes the number of merged reads for better assembly

Read Extension Technology

Uses tadpole's extend mode to:

Lengthen reads using k-mer overlap information
Improve assembly contiguity
Reduce gaps and improve scaffolding

Output Files

spades_out/contigs.fasta - Final assembly contigs
adapters.fa - Auto-detected adapter sequences
clean.fq.gz - Contamination-free reads
human.fq.gz - Removed human contamination
refStats.txt - Contamination removal statistics
quick.fa - Quick assembly for recalibration
clean.sam.gz - Mapping results for recalibration
merged_ext1.fq.gz - Extended merged reads
unmerged_ext1.fq.gz - Extended unmerged reads
spades.o - Spades assembly log

Memory and Performance

Memory Configuration

MAXRAM (48g): For memory-intensive steps, adjust for your system
HIGHRAM (31g): For high-memory operations
LOWRAM (4g): For lightweight operations

Performance Optimization

64 cores: Fully utilizes modern HPC systems
Compression level 6: Balances speed and disk usage
Containerized Spades: Ensures reproducibility and latest version

Troubleshooting

Memory errors: Adjust MAXRAM based on available system memory
Poly-G contamination detected: Check if polyfilter.sh completed successfully
Low assembly quality: Verify input data quality and coverage
Spades failures: Check spades.o log file for detailed error messages
Path not found errors: Ensure BBTools dev version path is correct
Contamination database errors: Verify MCDH path exists and is accessible

Version History

v1.1 (October 30, 2024): Latest version with improved poly-G filtering
v1.0: Initial release of poly-G artifact-free pipeline