Poly-G Artifact-Free Metagenome Assembly Pipeline
Cutting-edge metagenome assembly pipeline specifically designed to eliminate poly-G artifacts while handling the complex challenges of metagenomic data. Includes specialized preprocessing, quality recalibration, advanced poly-G filtering, error correction, and metagenomic assembly optimization.
Overview
This pipeline extends the poly-G artifact removal technology to metagenomic assemblies, addressing the unique challenges of assembling complex microbial communities while eliminating sequencing artifacts. It combines traditional metagenomic preprocessing with cutting-edge poly-G detection, quality score recalibration, and assembly strategies optimized for diverse community structures.
Prerequisites
System Requirements
- Perlmutter HPC system access
- BBTools development version (/global/cfs/cdirs/bbtools/jgi-bbtools/)
- Spades assembler v4.0.0 (via Shifter container) with metagenomic mode
- Mouse/Cat/Dog/Human contamination database
- 64 CPU cores recommended
- 48GB RAM (for login nodes) or 85% of requested memory for jobs
Input Requirements
- Raw (unfiltered) metagenomic Illumina sequencing reads
- High coverage recommended for complex communities
- Reads should be linked as "raw.fq.gz"
Configuration Variables
CORES=64 # CPU cores to use
ZL=6 # Compression level
MAXRAM=48g # Maximum memory (adjust for scheduled jobs)
HIGHRAM=31g # High memory allocation
LOWRAM=4g # Low memory allocation
These variables control resource allocation and should be adjusted based on your specific job requirements and metagenomic data complexity.
Pipeline Stages
1. Initial Preprocessing (Shared with Isolate Pipeline)
1.1 Adapter Detection and Trimming
# Detect adapter sequences
bbmerge.sh -Xmx4g in=raw.fq.gz outa=adapters.fa
# Trim adapters with sophisticated parameters
bbduk.sh -Xmx4g in=raw.fq.gz out=raw_trimmed.fq.gz tbo tpe hdist=2 k=23 mink=9 hdist2=1 ref=adapters.fa minlen=135 ktrim=r
Auto-detects adapter sequences from metagenomic data, then performs precision trimming optimized for diverse sequence content.
1.2 Optical Duplicate Removal
clumpify.sh -Xmx48g in=raw_trimmed.fq.gz out=deduped.fq.gz passes=4 dedupe optical dist=50
Removes optical duplicates while preserving genuine sequence diversity important for metagenomes.
1.3 Artifact and Contamination Removal
# Remove sequencing artifacts
bbduk.sh -Xmx4g ref=artifacts,phix literal=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA in=deduped.fq.gz k=31 hdist=1 out=filtered.fq.gz
# Remove host contamination
bbsplit.sh -Xmx48g deterministic ordered=false k=14 usemodulo printunmappedcount kfilter=25 maxsites=1 tipsearch=0 minratio=.9 maxindel=3 minhits=2 bw=12 bwr=0.16 fast maxsites2=10 build=1 ef=0.03 bloomfilter bloomk=29 bloomhashes=1 bloomminhits=6 bloomserial path=/global/cfs/cdirs/bbtools/RQCFilterData_Local/mousecatdoghuman/ refstats=refStats.txt forcereadonly in=filtered.fq.gz out=clean.fq.gz outm=human.fq.gz
Removes sequencing artifacts and host contamination using optimized parameters for metagenomic complexity.
2. Quality Recalibration System
2.1 Quick Metagenomic Assembly
tadpole.sh -Xmx31g in=clean.fq.gz out=qecc.fq.gz k=62 merge wash ecc tossjunk tu ldf=0.6 tossdepth=1 aecc
tadpole.sh -Xmx31g in=qecc.fq.gz out=quick.fa k=93 mcs=5 mce=4 merge
Creates a quick assembly optimized for metagenomic diversity with adjusted parameters (ldf=0.6 vs 0.4 for isolates, k=93 vs k=124) to handle community complexity.
2.2 Read Mapping for Recalibration
bbmap.sh -Xmx48g in=clean.fq.gz outm=clean.sam.gz vslow maxindel=40 ref=quick.fa
Maps reads to the metagenomic assembly for quality recalibration, using high memory allocation for complex mapping.
2.3 Quality Score Recalibration
calctruequality.sh -Xmx31g in=clean.sam.gz usetiles ref=quick.fa callvars
bbduk.sh -Xmx4g in=clean.fq.gz out=clean_recal_tile.fq.gz recalibrate usetiles
Recalibrates quality scores based on metagenomic variant patterns and tile-specific effects.
2.4 Tile-Based Quality Filtering
filterbytile.sh -Xmx31g in=clean_recal_tile.fq.gz out=fbt_recal_tile.fq.gz lowqualityonly=t
Removes reads from low-quality flowcell areas using recalibrated quality information.
3. Poly-G Artifact Removal
3.1 Primary Poly-G Filtering
polyfilter.sh -Xmx31g in=fbt_recal_tile.fq.gz out=polyfilter_fbt_recal_tile.fq.gz
Advanced poly-G artifact detection optimized for metagenomic data complexity.
3.2 Additional Poly-G Trimming
bbduk.sh -Xmx4g in=polyfilter_fbt_recal_tile.fq.gz trimpolyg=6 trimpolyc=6 maxnonpoly=2 minlen=135 literal=GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG k=29 hdist=2 out=hdist2.fq.gz
Secondary poly-G removal step that complements the primary filtering, ensuring comprehensive artifact removal while preserving legitimate sequences.
4. Metagenomic Assembly Preprocessing
4.1 Error Correction Phase 1
bbmerge.sh -Xmx48g in=hdist2.fq.gz out=ecco.fq.gz ecco mix adapters=adapters.fa kfilter=1 k=31 prefilter=1
Overlap-based error correction with prefilter=1 for memory efficiency with large metagenomic datasets.
4.2 Optional Error Correction Phase 2
tadpole.sh -Xmx48g in=ecco.fq.gz out=ecct.fq.gz ecc k=62 wash prefilter=1
K-mer based error correction (marked as optional) with prefilter for memory management in complex metagenomes.
4.3 Multi-Stage Metagenomic Read Merging
# Initial merging
bbmerge.sh -Xmx48g in=ecct.fq.gz outm=merged0.fq.gz outu=unmerged0.fq.gz kfilter=1 adapters=adapters.fa prefilter=1
# Iterative merging with prefilter for memory efficiency
bbmerge.sh -Xmx31g in=unmerged0.fq.gz extra=merged0.fq.gz out=merged_rem.fq.gz outu=unmerged_rem.fq.gz rem k=124 extend2=120 prefilter=1
bbmerge.sh -Xmx31g in=unmerged_rem.fq.gz extra=merged0.fq.gz,merged_rem.fq.gz out=merged_rem2.fq.gz outu=unmerged_rem2.fq.gz rem k=145 extend2=140 prefilter=1
bbmerge.sh -Xmx31g in=unmerged_rem2.fq.gz extra=merged0.fq.gz,merged_rem.fq.gz,merged_rem2.fq.gz out=merged_rem3.fq.gz outu=unmerged_rem3.fq.gz rem k=93 extend2=100 strict prefilter=1
Multi-stage merging optimized for metagenomic complexity with prefilter flags to manage memory usage with diverse communities.
4.4 Final Read Processing
# Combine merged reads
zcat merged0.fq.gz merged_rem.fq.gz merged_rem2.fq.gz merged_rem3.fq.gz | reformat.sh -Xmx4g in=stdin.fq int=f out=merged_both.fq.gz
# Quality trim unmerged reads
bbduk.sh -Xmx4g in=unmerged_rem3.fq.gz out=qtrimmed.fq.gz qtrim=r trimq=15 cardinality cardinalityout maq=14 minlen=90 ftr=149 maxns=1
Final processing of merged and unmerged reads optimized for metagenomic assembly.
5. Metagenomic Assembly with Spades
5.1 Metagenomic Spades Assembly
shifter --image=staphb/spades:4.0.0 spades.py -t 64 -k 25,55,95,127 --phred-offset 33 --only-assembler --meta --pe-m 1 merged_both.fq.gz --pe-12 1 qtrimmed.fq.gz -o spades_out
Optimized Spades metagenomic assembly using the --meta flag for community-specific optimizations and diverse k-mer sizes.
6. Assembly Validation
6.1 Poly-G Contamination Check
bbduk.sh -Xmx4g in=spades_out/contigs.fasta literal=GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG hdist=2 k=25
Validates that the metagenomic assembly is free from residual poly-G contamination.
6.2 Assembly Statistics
stats.sh in=spades_out/contigs.fasta
Generates comprehensive assembly statistics for metagenomic contiguity evaluation.
Basic Usage
# 1. Link your raw metagenomic reads
ln -s path/to/your/metagenomic_reads.fq.gz raw.fq.gz
# 2. Adjust memory settings if running as scheduled job
# Edit MAXRAM variable to 85% of requested memory
# 3. Run the metagenomic pipeline
bash assemble_polyg_meta_v1.sh
# 4. Check results in spades_out/contigs.fasta
Key Differences from Isolate Pipeline
Memory Management
- Prefilter flags: Added throughout the pipeline to manage memory with large metagenomic datasets
- Higher memory allocation: Uses MAXRAM (48g) more frequently for complex operations
- Memory-aware processing: Balances thoroughness with computational feasibility
Assembly Parameters
- Quick assembly k-mer: k=93 (vs k=124 for isolates) to handle diversity
- Error correction LDF: 0.6 (vs 0.4 for isolates) for community complexity
- Spades mode: --meta flag for metagenomic-specific optimizations
- Additional poly-G step: Includes secondary poly-G trimming for thoroughness
Processing Strategy
- Conservative error correction: Optional phase 2 to prevent over-correction
- Community-aware filtering: Preserves diversity while removing artifacts
- Memory-efficient merging: Prefilter flags throughout merging stages
Metagenomic Considerations
Community Complexity
The pipeline addresses unique metagenomic challenges:
- Diverse coverage: Accommodates highly variable coverage across species
- Strain variation: Preserves legitimate sequence diversity
- Repeat elements: Handles complex repeat structures across species
- Memory scaling: Manages computational requirements for large datasets
Quality Control Adaptations
- Relaxed filtering: Avoids over-aggressive filtering that might remove rare species
- Diversity preservation: Maintains community structure during processing
- Coverage-aware processing: Handles variable coverage patterns
Output Files
- spades_out/contigs.fasta - Final metagenomic assembly contigs
- adapters.fa - Auto-detected adapter sequences
- clean.fq.gz - Contamination-free metagenomic reads
- human.fq.gz - Removed host contamination
- refStats.txt - Contamination removal statistics
- quick.fa - Quick metagenomic assembly for recalibration
- clean.sam.gz - Mapping results for recalibration
- merged_both.fq.gz - Combined merged reads
- qtrimmed.fq.gz - Quality-trimmed unmerged reads
- spades.o - Spades assembly log
- hdist2.fq.gz - Reads after comprehensive poly-G removal
Performance and Scalability
Memory Optimization
- Prefilter strategy: Reduces memory footprint for large datasets
- Staged processing: Breaks complex operations into manageable chunks
- Dynamic allocation: Uses appropriate memory levels for each step
Computational Efficiency
- Parallel processing: 64-core utilization throughout pipeline
- I/O optimization: Compressed intermediate files with optimal compression
- Container deployment: Consistent Spades performance via Shifter
Troubleshooting
- Memory errors: Increase MAXRAM or add more prefilter flags to memory-intensive steps
- Low assembly contiguity: Common in metagenomes; consider coverage requirements
- Poly-G contamination: Verify both polyfilter.sh and secondary trimming completed
- Spades metagenomic failures: Check diversity and coverage of input data
- Long runtime: Expected for complex metagenomes; consider smaller test datasets
- High fragmentation: Normal for diverse communities; focus on total assembled bases
Best Practices
- Ensure sufficient sequencing depth for complex communities
- Monitor memory usage and adjust parameters for your system
- Consider community complexity when evaluating assembly metrics
- Validate poly-G removal effectiveness before downstream analysis
- Compare results with and without prefilter flags if memory permits
- Use appropriate coverage thresholds for rare species detection