Mitochondrial Assembly Pipeline (Illumina)

Script: assembleMito_Illumina.sh Authors: Brian Bushnell and William Andreopoulos Platform: Illumina reads

Specialized pipeline for assembling mitochondrial genomes from Illumina sequencing data. Uses coverage-based filtering to isolate mitochondrial reads, followed by targeted assembly optimized for the high coverage and unique characteristics of mitochondrial DNA.

Overview

This pipeline is specifically designed for assembling mitochondrial genomes from Illumina sequencing data. Mitochondrial DNA is typically present at much higher coverage than nuclear DNA, allowing for coverage-based separation and specialized assembly approaches. The pipeline uses k-mer counting to identify coverage peaks and applies targeted filtering to isolate mitochondrial sequences.

Prerequisites

System Requirements

Input Requirements

Pipeline Stages

1. Setup and Module Loading

Loads required dependencies based on system (currently supports Genepool, with placeholders for denovo and cori systems):

2. Initial Coverage Analysis

2.1 Raw Data K-mer Counting

kmercountexact.sh in=reads.fq.gz khist=khist_raw.txt peaks=peaks_raw.txt

Generates k-mer histogram and identifies coverage peaks in the raw data to determine baseline coverage levels.

2.2 Primary Coverage Calculation

primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
cutoff=$(( $primary * 3 ))

Extracts the haploid fold coverage from the peak analysis and calculates a high-pass filter cutoff at 3x primary coverage.

3. High-Coverage Read Isolation

3.1 Coverage-Based Filtering

bbnorm.sh in=reads.fq.gz out=highpass.fq.gz pigz passes=1 bits=16 min=$cutoff target=9999999

Filters reads based on k-mer coverage, retaining only high-coverage reads likely to be mitochondrial. The high target value prevents normalization while applying the minimum coverage filter.

3.2 GC Content Filtering

reformat.sh in=highpass.fq.gz out=highpass_gc.fq.gz maxgc=0.45

Filters reads by GC content, removing reads with GC content above 45% to focus on mitochondrial sequences (which typically have lower GC content than nuclear DNA).

4. Mitochondrial Peak Detection

4.1 High-Resolution K-mer Analysis

kmercountexact.sh in=highpass_gc.fq.gz khist=khist_100.txt k=100 peaks=peaks_100.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2

Performs detailed k-mer counting on filtered reads using k=100 with sophisticated smoothing to identify the mitochondrial coverage peak clearly.

4.2 Assembly Parameters Calculation

mitopeak=`grep "main_peak" peaks_100.txt | sed "s/^.*\t//g"`

upper=$((mitopeak * 6 / 3))     # 2x mitopeak
lower=$((mitopeak * 3 / 7))     # ~0.43x mitopeak
mcs=$((mitopeak * 3 / 4))       # 0.75x mitopeak
mincov=$((mitopeak * 2 / 3))    # ~0.67x mitopeak

Calculates assembly parameters based on the mitochondrial peak coverage to optimize assembly for mitochondrial sequences.

5. Mitochondrial Assembly

5.1 Initial Assembly with TadWrapper

tadwrapper.sh in=highpass_gc.fq.gz out=contigs_intermediate_%.fa k=78,100,120 outfinal=contigs_intermediate.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov

Performs multi-k assembly optimized for mitochondrial coverage patterns using coverage constraints derived from the mitochondrial peak.

5.2 Read Selection for Final Assembly

bbduk.sh in=highpass.fq.gz ref=contigs_intermediate.fa outm=bbd005.fq.gz k=31 mm=f mkf=0.05

Uses the intermediate assembly as a reference to select reads that match mitochondrial sequences, further purifying the read set.

5.3 Final Mitochondrial Assembly

tadpole.sh in=bbd005.fq.gz out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 8)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=100 bm1=6

Generates the final mitochondrial assembly using the purified read set with optimized parameters including bubble merging (bm1=6) to handle repeat regions.

5.4 Final Output

ln -s contigs_bbd.fa contigs.fa

Creates a symbolic link to the final assembly for standardized output naming.

Basic Usage

# 1. Link your Illumina reads
ln -s your_illumina_reads.fq.gz reads.fq.gz

# 2. Run the mitochondrial assembly pipeline
bash assembleMito_Illumina.sh

# 3. Final assembly will be available as contigs.fa

Algorithm Details

Coverage-Based Mitochondrial Isolation

The pipeline exploits the fact that mitochondrial DNA is typically present at 10-1000x higher coverage than nuclear DNA in whole genome sequencing:

  1. Primary Peak Detection: Identifies the nuclear genome coverage peak
  2. High-Pass Filtering: Retains reads with coverage ≥3x nuclear peak
  3. Mitochondrial Peak Analysis: Re-analyzes filtered data to identify mitochondrial peak
  4. Optimized Assembly: Uses mitochondrial-specific coverage constraints

Parameter Calculations

Output Files

Notes and Tips

Troubleshooting