Assembly Pipeline

Script: assemblyPipeline.sh Author: Brian Bushnell Last Updated: March 4, 2019

Comprehensive assembly pipeline designed for preprocessing and assembling overlapping 2x150bp reads from Illumina HiSeq 2500. This script includes data preprocessing, error correction, read merging, and assembly with multiple assemblers for optimal results.

Overview

This pipeline is specifically designed for Illumina HiSeq 2500 data with overlapping 2x150bp reads. It performs comprehensive preprocessing including quality filtering, adapter trimming, error correction, and read merging before assembly. The pipeline supports multiple assemblers (Tadpole, Spades, Megahit) and includes evaluation steps.

Note: Some numbers and steps may need adjustment for different data types or file paths. For large genomes, tadpole and bbmerge may need the flag "prefilter=2" to avoid running out of memory, though this doubles the processing time.

Prerequisites

System Requirements

Input Requirements

Pipeline Stages

1. Setup and Module Loading

Loads required dependencies based on system (currently supports Genepool, with placeholders for denovo and cori systems):

2. Preprocessing Phase

2.1 Optical Duplicate Removal

clumpify.sh in=temp.fq.gz out=clumped.fq.gz dedupe optical

Removes optical duplicates that can occur during sequencing, improving data quality.

2.2 Quality Filtering by Tile

filterbytile.sh in=temp.fq.gz out=filtered_by_tile.fq.gz

Removes reads from low-quality regions of the flowcell based on positional quality patterns.

2.3 Adapter Trimming

bbduk.sh in=temp.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=70 ref=adapters ftm=5 ordered

Trims adapters from read ends. Optional parameters can discard reads with Ns (maxns=0) or very low average quality (maq=8).

2.4 Artifact and Spike-in Removal

bbduk.sh in=temp.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix ordered cardinality

Removes synthetic artifacts and spike-ins by k-mer matching against known contaminant sequences.

2.5 Optional Decontamination

The pipeline includes placeholders for decontamination mapping, which JGI performs in two phases:

  1. Common microbial contaminants (E.coli, Pseudomonas, Delftia, others)
  2. Common animal contaminants (Human, cat, dog, mouse)

3. Error Correction Phase

3.1 Error Correction Phase 1

bbmerge.sh in=temp.fq.gz out=ecco.fq.gz ecco mix vstrict ordered ihist=ihist_merge1.txt

Uses overlap-based error correction with very strict parameters.

3.2 Error Correction Phase 2

clumpify.sh in=temp.fq.gz out=eccc.fq.gz ecc passes=4 reorder

Performs clumping-based error correction with 4 passes and read reordering.

3.3 Error Correction Phase 3

tadpole.sh in=temp.fq.gz out=ecct.fq.gz ecc k=62 ordered

K-mer based error correction using Tadpole. Can use tossjunk, tossdepth, or tossuncorrectable flags to discard low-depth reads. For very large datasets, add prefilter=1 or prefilter=2. Alternatively, bbcms.sh can be used if Tadpole runs out of memory.

3.4 Optional Normalization

# Commented out by default
# bbnorm.sh in=temp.fq.gz out=normalized.fq.gz target=100 hist=khist.txt peaks=peaks.txt

Normalization is beneficial for uneven coverage data (metagenomes, MDA-amplified single cells, RNA-seq) but not usually recommended for isolate DNA.

4. Read Merging Phase

4.1 Overlap-based Merging

bbmerge-auto.sh in=temp.fq.gz out=merged.fq.gz outu=unmerged.fq.gz strict k=93 extend2=80 rem ordered ihist=ihist_merge.txt

Handles overlapping reads and nonoverlapping reads with sufficient coverage and short inter-read gaps. For large datasets, add prefilter=1 or prefilter=2.

4.2 Quality Trimming of Unmerged Reads

bbduk.sh in=unmerged.fq.gz out=qtrimmed.fq.gz qtrim=r trimq=10 minlen=70 ordered

Quality-trims the unmerged reads to remove low-quality ends.

5. Assembly Phase

The pipeline supports three different assemblers. You can use one or all depending on your needs:

5.1 Tadpole Assembly

tadpole.sh in=merged.fq.gz,qtrimmed.fq.gz out=tadpole_contigs.fa k=124

Fast k-mer based assembly. For large datasets, add prefilter=1 or prefilter=2.

5.2 TadWrapper Assembly

tadwrapper.sh in=merged.fq.gz,qtrimmed.fq.gz out=tadwrapper_contigs_%.fa outfinal=tadwrapper_contigs k=40,124,217 bisect

Automatically finds the best K value but takes longer than standard Tadpole.

5.3 Spades Assembly

spades.py -k 25,55,95,125 --phred-offset 33 -s merged.fq.gz --12 qtrimmed.fq.gz -o spades_out

Multi-k Spades assembly using merged single reads and unmerged paired reads.

5.4 Megahit Assembly

megahit --k-min 45 --k-max 225 --k-step 26 --min-count 2 -r merged.fq.gz --12 qtrimmed.fq.gz -o megahit_out

Memory-efficient assembler. Note that error correction phases tend to not be beneficial for Megahit.

6. Evaluation Phase

6.1 BBTools Assembly Statistics

statswrapper.sh contigs.fa spades_out/scaffolds.fasta megahit_out/contigs.fa format=3 out=

Generates comprehensive assembly statistics for all assemblies.

6.2 Quast Evaluation

quast.py -f -o quast -R ref.fa tadpole_contigs.fa spades_out/scaffolds.fasta megahit_out/contigs.fa

Detailed assembly evaluation. Remove "-R ref.fa" if no reference genome is available.

6.3 Taxonomic Analysis

# Overall taxonomic composition
sendsketch.sh in=tadpole_contigs.fa

# Per-contig taxonomy (if more sensitivity needed, use BLAST)
sendsketch.sh in=tadpole_contigs.fa persequence minhits=1 records=4

Determines taxonomic makeup of the assembly.

6.4 Coverage Analysis

bbmap.sh in=filtered.fq.gz ref=tadpole_contigs.fa nodisk covhist=covhist.txt covstats=covstats.txt outm=assembled.fq.gz outu=unassembled.fq.gz maxindel=200 minid=90 qtrim=10 untrim ambig=all

Calculates coverage distribution and captures reads that didn't make it into the assembly.

Basic Usage

# 1. Link your input file
ln -s your_reads.fq.gz reads.fq.gz

# 2. Run the pipeline
bash assemblyPipeline.sh

# 3. Choose the best assembly from the evaluation results

Memory Considerations

For large genomes, several tools may require the prefilter=2 flag to avoid running out of memory:

Note that using prefilter makes these steps take approximately twice as long, so only use it when necessary.

Output Files

Notes and Tips