Poly-G Artifact-Free Isolate Assembly Pipeline

Script: assemble_polyg_isolate_v1.sh Author: Brian Bushnell Version: 1.1 Last Updated: October 30, 2024 Platform: Perlmutter HPC

State-of-the-art isolate assembly pipeline specifically designed to eliminate poly-G artifacts while producing high-quality genome assemblies. Includes comprehensive preprocessing, quality recalibration, advanced poly-G filtering, error correction, read extension, and assembly with Spades.

Overview

This pipeline represents the latest advancement in addressing poly-G artifacts that can severely impact genome assembly quality, particularly with newer Illumina sequencing chemistry. It combines traditional preprocessing steps with cutting-edge poly-G detection and removal techniques, quality score recalibration, and optimized assembly strategies for isolate genomes.

Important: This pipeline is optimized for the Perlmutter HPC system and requires specific path configurations. Memory settings are configured for login nodes and should be adjusted for scheduled jobs (set MAXRAM to 85% of requested memory).

Prerequisites

System Requirements

Input Requirements

Configuration Variables

CORES=64                    # CPU cores to use
ZL=6                        # Compression level
HIGHRAM=31g                 # High memory allocation
MAXRAM=48g                  # Maximum memory (adjust for scheduled jobs)
MCDH=/global/cfs/cdirs/bbtools/RQCFilterData_Local/mousecatdoghuman/

These variables control resource allocation and should be adjusted based on your specific job requirements and system configuration.

Pipeline Stages

1. Initial Preprocessing

1.1 Adapter Detection and Trimming

# Detect adapter sequences
bbmerge.sh -Xmx4g in=raw.fq.gz outa=adapters.fa

# Trim adapters with sophisticated parameters
bbduk.sh -Xmx4g in=raw.fq.gz out=raw_trimmed.fq.gz tbo tpe hdist=2 k=23 mink=9 hdist2=1 ref=adapters.fa minlen=135 ktrim=r

Auto-detects adapter sequences from the data itself, then performs precision trimming with dual hamming distance parameters for different k-mer sizes.

1.2 Advanced Deduplication

clumpify.sh -Xmx48g in=raw_trimmed.fq.gz out=deduped.fq.gz passes=4 groups=1 dedupe optical dist=50

Removes optical duplicates and other PCR duplicates using multi-pass processing with optimized distance parameters.

1.3 Artifact Removal

bbduk.sh -Xmx4g ref=artifacts,phix literal=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA in=deduped.fq.gz k=31 hdist=1 out=filtered.fq.gz

Removes known sequencing artifacts, PhiX contamination, and poly-A sequences that might interfere with assembly.

1.4 Human Contamination Removal

bbsplit.sh -Xmx48g deterministic ordered=false k=14 usemodulo printunmappedcount kfilter=25 maxsites=1 tipsearch=0 minratio=.9 maxindel=3 minhits=2 bw=12 bwr=0.16 fast maxsites2=10 build=1 ef=0.03 bloomfilter bloomk=29 bloomhashes=1 bloomminhits=6 bloomserial path="$MCDH" refstats=refStats.txt forcereadonly in=filtered.fq.gz out=clean.fq.gz outm=human.fq.gz

Sophisticated contamination removal targeting mouse, cat, dog, and human sequences using bloom filter acceleration and optimized alignment parameters.

2. Quality Recalibration System

2.1 Quick Assembly for Recalibration

tadpole.sh -Xmx31g in=clean.fq.gz out=qecc.fq.gz k=62 merge wash ecc tossjunk tu ldf=0.4 tossdepth=1 aecc
tadpole.sh -Xmx31g in=qecc.fq.gz out=quick.fa k=124 mcs=5 mce=4 merge

Creates a quick assembly for use as a reference in quality score recalibration, with aggressive junk removal and error correction.

2.2 Read Mapping and Variant Calling

bbmap.sh -Xmx31g in=clean.fq.gz out=clean.sam.gz vslow maxindel=40 ref=quick.fa

Maps cleaned reads to the quick assembly with very slow, high-accuracy alignment for precise variant detection.

2.3 Quality Score Recalibration

calctruequality.sh -Xmx31g in=clean.sam.gz usetiles ref=quick.fa callvars
bbduk.sh -Xmx4g in=clean.fq.gz out=clean_recal_tile.fq.gz recalibrate usetiles

Calculates true quality scores based on variant calls and tile position, then recalibrates quality scores for improved downstream processing.

2.4 Tile-Based Quality Filtering

filterbytile.sh -Xmx31g in=clean_recal_tile.fq.gz out=fbt_recal_tile.fq.gz lowqualityonly=t

Removes reads from low-quality flowcell tiles using the recalibrated quality information.

3. Poly-G Artifact Removal

3.1 Primary Poly-G Filtering

polyfilter.sh -Xmx31g in=fbt_recal_tile.fq.gz out=polyfilter_fbt_recal_tile.fq.gz

State-of-the-art poly-G artifact detection and removal using advanced algorithms that can distinguish true poly-G sequences from artifacts.

4. Isolate-Specific Assembly Preprocessing

4.1 Error Correction Phase 1

bbmerge.sh -Xmx31g in=hdist2.fq.gz out=ecco.fq.gz ecco mix adapters=adapters.fa kfilter=1 k=31

Overlap-based error correction using detected adapter sequences for improved accuracy.

4.2 Error Correction Phase 2

tadpole.sh -Xmx31g in=ecco.fq.gz out=ecct.fq.gz ecc k=62 wash tu tossdepth=1 ldf=0.15

K-mer based error correction with aggressive depth filtering optimized for isolate genomes.

4.3 Multi-Stage Read Merging

# Initial merging
bbmerge.sh -Xmx31g in=ecct.fq.gz outm=merged0.fq.gz outu=unmerged0.fq.gz kfilter=1 adapters=adapters.fa

# Iterative merging with kmer extension
bbmerge.sh -Xmx31g in=unmerged0.fq.gz extra=merged0.fq.gz out=merged_rem.fq.gz outu=unmerged_rem.fq.gz rem k=124 extend2=120
bbmerge.sh -Xmx31g in=unmerged_rem.fq.gz extra=merged0.fq.gz,merged_rem.fq.gz out=merged_rem2.fq.gz outu=unmerged_rem2.fq.gz rem k=145 extend2=140
bbmerge.sh -Xmx31g in=unmerged_rem2.fq.gz extra=merged0.fq.gz,merged_rem.fq.gz,merged_rem2.fq.gz out=merged_rem3.fq.gz outu=unmerged_rem3.fq.gz rem k=93 extend2=100 strict

Sophisticated multi-stage merging with k-mer extension to maximize the number of merged reads and improve assembly contiguity.

4.4 Read Processing and Extension

# Combine merged reads
zcat merged0.fq.gz merged_rem.fq.gz merged_rem2.fq.gz merged_rem3.fq.gz | reformat.sh -Xmx4g in=stdin.fq int=f out=merged_both.fq.gz

# Quality trim unmerged reads
bbduk.sh -Xmx4g in=unmerged_rem3.fq.gz out=qtrimmed.fq.gz qtrim=r trimq=15 cardinality cardinalityout maq=14 minlen=90 ftr=149 maxns=1

# Extend reads using overlaps
tadpole.sh -Xmx31g in=merged_both.fq.gz extra=qtrimmed.fq.gz out=merged_ext1.fq.gz mode=extend mce=4 er=10 el=10 k=145
tadpole.sh -Xmx31g in=qtrimmed.fq.gz extra=merged_both.fq.gz out=unmerged_ext1.fq.gz mode=extend mce=4 el=10 k=124

Advanced read extension using k-mer overlap information to increase read length and improve assembly quality.

5. Assembly with Spades

5.1 Optimized Spades Assembly

shifter --image=staphb/spades:4.0.0 spades.py -t 64 -k 25,55,95,127 --phred-offset 33 --only-assembler --isolate --pe-m 1 merged_ext1.fq.gz --pe-12 1 unmerged_ext1.fq.gz -o spades_out

High-performance Spades assembly using containerized environment with isolate-specific optimizations and extended reads.

6. Assembly Validation

6.1 Poly-G Contamination Check

bbduk.sh -Xmx4g in=spades_out/contigs.fasta literal=GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG hdist=2 k=25

Validates that the assembly is free from residual poly-G contamination by searching for long poly-G sequences.

6.2 Assembly Statistics

stats.sh in=spades_out/contigs.fasta

Generates comprehensive assembly statistics to evaluate contiguity and quality metrics.

Basic Usage

# 1. Link your raw reads (must be unfiltered)
ln -s path/to/your/reads.fq.gz raw.fq.gz

# 2. Adjust memory settings if running as scheduled job
# Edit MAXRAM variable to 85% of requested memory

# 3. Run the pipeline
bash assemble_polyg_isolate_v1.sh

# 4. Check results in spades_out/contigs.fasta

Key Innovations

Poly-G Artifact Removal

This pipeline includes the latest polyfilter.sh tool which uses sophisticated algorithms to:

Quality Score Recalibration

Advanced quality score recalibration system that:

Multi-Stage Read Merging

Sophisticated merging strategy that:

Read Extension Technology

Uses tadpole's extend mode to:

Output Files

Memory and Performance

Memory Configuration

Performance Optimization

Troubleshooting

Version History