Poly-G Artifact-Free Metagenome Assembly Pipeline

Script: assemble_polyg_meta_v1.sh Author: Brian Bushnell Version: 1.0 Last Updated: September 17, 2024 Platform: Perlmutter HPC

Cutting-edge metagenome assembly pipeline specifically designed to eliminate poly-G artifacts while handling the complex challenges of metagenomic data. Includes specialized preprocessing, quality recalibration, advanced poly-G filtering, error correction, and metagenomic assembly optimization.

Overview

This pipeline extends the poly-G artifact removal technology to metagenomic assemblies, addressing the unique challenges of assembling complex microbial communities while eliminating sequencing artifacts. It combines traditional metagenomic preprocessing with cutting-edge poly-G detection, quality score recalibration, and assembly strategies optimized for diverse community structures.

Important: This pipeline requires the development version of BBTools and is optimized for Perlmutter HPC. Some tools may not be available in the release version yet. Memory settings are configured for login nodes and should be adjusted for scheduled jobs.

Prerequisites

System Requirements

Input Requirements

Configuration Variables

CORES=64                    # CPU cores to use
ZL=6                        # Compression level
MAXRAM=48g                  # Maximum memory (adjust for scheduled jobs)
HIGHRAM=31g                 # High memory allocation
LOWRAM=4g                   # Low memory allocation

These variables control resource allocation and should be adjusted based on your specific job requirements and metagenomic data complexity.

Pipeline Stages

1. Initial Preprocessing (Shared with Isolate Pipeline)

1.1 Adapter Detection and Trimming

# Detect adapter sequences
bbmerge.sh -Xmx4g in=raw.fq.gz outa=adapters.fa

# Trim adapters with sophisticated parameters
bbduk.sh -Xmx4g in=raw.fq.gz out=raw_trimmed.fq.gz tbo tpe hdist=2 k=23 mink=9 hdist2=1 ref=adapters.fa minlen=135 ktrim=r

Auto-detects adapter sequences from metagenomic data, then performs precision trimming optimized for diverse sequence content.

1.2 Optical Duplicate Removal

clumpify.sh -Xmx48g in=raw_trimmed.fq.gz out=deduped.fq.gz passes=4 dedupe optical dist=50

Removes optical duplicates while preserving genuine sequence diversity important for metagenomes.

1.3 Artifact and Contamination Removal

# Remove sequencing artifacts
bbduk.sh -Xmx4g ref=artifacts,phix literal=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA in=deduped.fq.gz k=31 hdist=1 out=filtered.fq.gz

# Remove host contamination
bbsplit.sh -Xmx48g deterministic ordered=false k=14 usemodulo printunmappedcount kfilter=25 maxsites=1 tipsearch=0 minratio=.9 maxindel=3 minhits=2 bw=12 bwr=0.16 fast maxsites2=10 build=1 ef=0.03 bloomfilter bloomk=29 bloomhashes=1 bloomminhits=6 bloomserial path=/global/cfs/cdirs/bbtools/RQCFilterData_Local/mousecatdoghuman/ refstats=refStats.txt forcereadonly in=filtered.fq.gz out=clean.fq.gz outm=human.fq.gz

Removes sequencing artifacts and host contamination using optimized parameters for metagenomic complexity.

2. Quality Recalibration System

2.1 Quick Metagenomic Assembly

tadpole.sh -Xmx31g in=clean.fq.gz out=qecc.fq.gz k=62 merge wash ecc tossjunk tu ldf=0.6 tossdepth=1 aecc
tadpole.sh -Xmx31g in=qecc.fq.gz out=quick.fa k=93 mcs=5 mce=4 merge

Creates a quick assembly optimized for metagenomic diversity with adjusted parameters (ldf=0.6 vs 0.4 for isolates, k=93 vs k=124) to handle community complexity.

2.2 Read Mapping for Recalibration

bbmap.sh -Xmx48g in=clean.fq.gz outm=clean.sam.gz vslow maxindel=40 ref=quick.fa

Maps reads to the metagenomic assembly for quality recalibration, using high memory allocation for complex mapping.

2.3 Quality Score Recalibration

calctruequality.sh -Xmx31g in=clean.sam.gz usetiles ref=quick.fa callvars
bbduk.sh -Xmx4g in=clean.fq.gz out=clean_recal_tile.fq.gz recalibrate usetiles

Recalibrates quality scores based on metagenomic variant patterns and tile-specific effects.

2.4 Tile-Based Quality Filtering

filterbytile.sh -Xmx31g in=clean_recal_tile.fq.gz out=fbt_recal_tile.fq.gz lowqualityonly=t

Removes reads from low-quality flowcell areas using recalibrated quality information.

3. Poly-G Artifact Removal

3.1 Primary Poly-G Filtering

polyfilter.sh -Xmx31g in=fbt_recal_tile.fq.gz out=polyfilter_fbt_recal_tile.fq.gz

Advanced poly-G artifact detection optimized for metagenomic data complexity.

3.2 Additional Poly-G Trimming

bbduk.sh -Xmx4g in=polyfilter_fbt_recal_tile.fq.gz trimpolyg=6 trimpolyc=6 maxnonpoly=2 minlen=135 literal=GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG k=29 hdist=2 out=hdist2.fq.gz

Secondary poly-G removal step that complements the primary filtering, ensuring comprehensive artifact removal while preserving legitimate sequences.

4. Metagenomic Assembly Preprocessing

4.1 Error Correction Phase 1

bbmerge.sh -Xmx48g in=hdist2.fq.gz out=ecco.fq.gz ecco mix adapters=adapters.fa kfilter=1 k=31 prefilter=1

Overlap-based error correction with prefilter=1 for memory efficiency with large metagenomic datasets.

4.2 Optional Error Correction Phase 2

tadpole.sh -Xmx48g in=ecco.fq.gz out=ecct.fq.gz ecc k=62 wash prefilter=1

K-mer based error correction (marked as optional) with prefilter for memory management in complex metagenomes.

4.3 Multi-Stage Metagenomic Read Merging

# Initial merging
bbmerge.sh -Xmx48g in=ecct.fq.gz outm=merged0.fq.gz outu=unmerged0.fq.gz kfilter=1 adapters=adapters.fa prefilter=1

# Iterative merging with prefilter for memory efficiency
bbmerge.sh -Xmx31g in=unmerged0.fq.gz extra=merged0.fq.gz out=merged_rem.fq.gz outu=unmerged_rem.fq.gz rem k=124 extend2=120 prefilter=1
bbmerge.sh -Xmx31g in=unmerged_rem.fq.gz extra=merged0.fq.gz,merged_rem.fq.gz out=merged_rem2.fq.gz outu=unmerged_rem2.fq.gz rem k=145 extend2=140 prefilter=1
bbmerge.sh -Xmx31g in=unmerged_rem2.fq.gz extra=merged0.fq.gz,merged_rem.fq.gz,merged_rem2.fq.gz out=merged_rem3.fq.gz outu=unmerged_rem3.fq.gz rem k=93 extend2=100 strict prefilter=1

Multi-stage merging optimized for metagenomic complexity with prefilter flags to manage memory usage with diverse communities.

4.4 Final Read Processing

# Combine merged reads
zcat merged0.fq.gz merged_rem.fq.gz merged_rem2.fq.gz merged_rem3.fq.gz | reformat.sh -Xmx4g in=stdin.fq int=f out=merged_both.fq.gz

# Quality trim unmerged reads
bbduk.sh -Xmx4g in=unmerged_rem3.fq.gz out=qtrimmed.fq.gz qtrim=r trimq=15 cardinality cardinalityout maq=14 minlen=90 ftr=149 maxns=1

Final processing of merged and unmerged reads optimized for metagenomic assembly.

5. Metagenomic Assembly with Spades

5.1 Metagenomic Spades Assembly

shifter --image=staphb/spades:4.0.0 spades.py -t 64 -k 25,55,95,127 --phred-offset 33 --only-assembler --meta --pe-m 1 merged_both.fq.gz --pe-12 1 qtrimmed.fq.gz -o spades_out

Optimized Spades metagenomic assembly using the --meta flag for community-specific optimizations and diverse k-mer sizes.

6. Assembly Validation

6.1 Poly-G Contamination Check

bbduk.sh -Xmx4g in=spades_out/contigs.fasta literal=GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG hdist=2 k=25

Validates that the metagenomic assembly is free from residual poly-G contamination.

6.2 Assembly Statistics

stats.sh in=spades_out/contigs.fasta

Generates comprehensive assembly statistics for metagenomic contiguity evaluation.

Basic Usage

# 1. Link your raw metagenomic reads
ln -s path/to/your/metagenomic_reads.fq.gz raw.fq.gz

# 2. Adjust memory settings if running as scheduled job
# Edit MAXRAM variable to 85% of requested memory

# 3. Run the metagenomic pipeline
bash assemble_polyg_meta_v1.sh

# 4. Check results in spades_out/contigs.fasta

Key Differences from Isolate Pipeline

Memory Management

Assembly Parameters

Processing Strategy

Metagenomic Considerations

Community Complexity

The pipeline addresses unique metagenomic challenges:

Quality Control Adaptations

Output Files

Performance and Scalability

Memory Optimization

Computational Efficiency

Troubleshooting

Best Practices