Assembly Pipeline
Comprehensive assembly pipeline designed for preprocessing and assembling overlapping 2x150bp reads from Illumina HiSeq 2500. This script includes data preprocessing, error correction, read merging, and assembly with multiple assemblers for optimal results.
Overview
This pipeline is specifically designed for Illumina HiSeq 2500 data with overlapping 2x150bp reads. It performs comprehensive preprocessing including quality filtering, adapter trimming, error correction, and read merging before assembly. The pipeline supports multiple assemblers (Tadpole, Spades, Megahit) and includes evaluation steps.
Prerequisites
System Requirements
- BBTools suite installed
- Spades assembler (v3.9.0 or later)
- Megahit assembler
- pigz for parallel compression
- Quast for assembly evaluation
Input Requirements
- Interleaved FASTQ file named "reads.fq.gz"
- Sufficient memory for assembly (varies by genome size)
- Reference genome file "ref.fa" (optional, for Quast evaluation)
Pipeline Stages
1. Setup and Module Loading
Loads required dependencies based on system (currently supports Genepool, with placeholders for denovo and cori systems):
- BBTools
- Spades v3.9.0
- Megahit
- pigz
- Quast
2. Preprocessing Phase
2.1 Optical Duplicate Removal
clumpify.sh in=temp.fq.gz out=clumped.fq.gz dedupe optical
Removes optical duplicates that can occur during sequencing, improving data quality.
2.2 Quality Filtering by Tile
filterbytile.sh in=temp.fq.gz out=filtered_by_tile.fq.gz
Removes reads from low-quality regions of the flowcell based on positional quality patterns.
2.3 Adapter Trimming
bbduk.sh in=temp.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=70 ref=adapters ftm=5 ordered
Trims adapters from read ends. Optional parameters can discard reads with Ns (maxns=0) or very low average quality (maq=8).
2.4 Artifact and Spike-in Removal
bbduk.sh in=temp.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix ordered cardinality
Removes synthetic artifacts and spike-ins by k-mer matching against known contaminant sequences.
2.5 Optional Decontamination
The pipeline includes placeholders for decontamination mapping, which JGI performs in two phases:
- Common microbial contaminants (E.coli, Pseudomonas, Delftia, others)
- Common animal contaminants (Human, cat, dog, mouse)
3. Error Correction Phase
3.1 Error Correction Phase 1
bbmerge.sh in=temp.fq.gz out=ecco.fq.gz ecco mix vstrict ordered ihist=ihist_merge1.txt
Uses overlap-based error correction with very strict parameters.
3.2 Error Correction Phase 2
clumpify.sh in=temp.fq.gz out=eccc.fq.gz ecc passes=4 reorder
Performs clumping-based error correction with 4 passes and read reordering.
3.3 Error Correction Phase 3
tadpole.sh in=temp.fq.gz out=ecct.fq.gz ecc k=62 ordered
K-mer based error correction using Tadpole. Can use tossjunk, tossdepth, or tossuncorrectable flags to discard low-depth reads. For very large datasets, add prefilter=1 or prefilter=2. Alternatively, bbcms.sh can be used if Tadpole runs out of memory.
3.4 Optional Normalization
# Commented out by default
# bbnorm.sh in=temp.fq.gz out=normalized.fq.gz target=100 hist=khist.txt peaks=peaks.txt
Normalization is beneficial for uneven coverage data (metagenomes, MDA-amplified single cells, RNA-seq) but not usually recommended for isolate DNA.
4. Read Merging Phase
4.1 Overlap-based Merging
bbmerge-auto.sh in=temp.fq.gz out=merged.fq.gz outu=unmerged.fq.gz strict k=93 extend2=80 rem ordered ihist=ihist_merge.txt
Handles overlapping reads and nonoverlapping reads with sufficient coverage and short inter-read gaps. For large datasets, add prefilter=1 or prefilter=2.
4.2 Quality Trimming of Unmerged Reads
bbduk.sh in=unmerged.fq.gz out=qtrimmed.fq.gz qtrim=r trimq=10 minlen=70 ordered
Quality-trims the unmerged reads to remove low-quality ends.
5. Assembly Phase
The pipeline supports three different assemblers. You can use one or all depending on your needs:
5.1 Tadpole Assembly
tadpole.sh in=merged.fq.gz,qtrimmed.fq.gz out=tadpole_contigs.fa k=124
Fast k-mer based assembly. For large datasets, add prefilter=1 or prefilter=2.
5.2 TadWrapper Assembly
tadwrapper.sh in=merged.fq.gz,qtrimmed.fq.gz out=tadwrapper_contigs_%.fa outfinal=tadwrapper_contigs k=40,124,217 bisect
Automatically finds the best K value but takes longer than standard Tadpole.
5.3 Spades Assembly
spades.py -k 25,55,95,125 --phred-offset 33 -s merged.fq.gz --12 qtrimmed.fq.gz -o spades_out
Multi-k Spades assembly using merged single reads and unmerged paired reads.
5.4 Megahit Assembly
megahit --k-min 45 --k-max 225 --k-step 26 --min-count 2 -r merged.fq.gz --12 qtrimmed.fq.gz -o megahit_out
Memory-efficient assembler. Note that error correction phases tend to not be beneficial for Megahit.
6. Evaluation Phase
6.1 BBTools Assembly Statistics
statswrapper.sh contigs.fa spades_out/scaffolds.fasta megahit_out/contigs.fa format=3 out=
Generates comprehensive assembly statistics for all assemblies.
6.2 Quast Evaluation
quast.py -f -o quast -R ref.fa tadpole_contigs.fa spades_out/scaffolds.fasta megahit_out/contigs.fa
Detailed assembly evaluation. Remove "-R ref.fa" if no reference genome is available.
6.3 Taxonomic Analysis
# Overall taxonomic composition
sendsketch.sh in=tadpole_contigs.fa
# Per-contig taxonomy (if more sensitivity needed, use BLAST)
sendsketch.sh in=tadpole_contigs.fa persequence minhits=1 records=4
Determines taxonomic makeup of the assembly.
6.4 Coverage Analysis
bbmap.sh in=filtered.fq.gz ref=tadpole_contigs.fa nodisk covhist=covhist.txt covstats=covstats.txt outm=assembled.fq.gz outu=unassembled.fq.gz maxindel=200 minid=90 qtrim=10 untrim ambig=all
Calculates coverage distribution and captures reads that didn't make it into the assembly.
Basic Usage
# 1. Link your input file
ln -s your_reads.fq.gz reads.fq.gz
# 2. Run the pipeline
bash assemblyPipeline.sh
# 3. Choose the best assembly from the evaluation results
Memory Considerations
For large genomes, several tools may require the prefilter=2
flag to avoid running out of memory:
- tadpole.sh (error correction phase)
- bbmerge.sh (during merge phase)
Note that using prefilter makes these steps take approximately twice as long, so only use it when necessary.
Output Files
- tadpole_contigs.fa - Tadpole assembly results
- tadwrapper_contigs.fa - TadWrapper assembly results
- spades_out/scaffolds.fasta - Spades assembly results
- megahit_out/contigs.fa - Megahit assembly results
- quast/ - Directory containing Quast evaluation reports
- covhist.txt, covstats.txt - Coverage analysis results
- assembled.fq.gz, unassembled.fq.gz - Reads that did/didn't assemble
Notes and Tips
- The symbolic linking approach (rm temp.fq.gz; ln -s file.fq.gz temp.fq.gz) allows easy disabling of pipeline stages
- Parameters and thresholds may need adjustment for different data types
- Consider system-specific module loading requirements
- Multiple assemblers provide options for comparison and validation
- The pipeline is optimized for Illumina HiSeq 2500 data but can be adapted for other platforms