Mitochondrial Assembly Pipeline (Illumina)

Script: assembleMito_Illumina.sh Authors: Brian Bushnell and William Andreopoulos Platform: Illumina reads

Specialized pipeline for assembling mitochondrial genomes from Illumina sequencing data. Uses coverage-based filtering to isolate mitochondrial reads, followed by targeted assembly optimized for the high coverage and unique characteristics of mitochondrial DNA.

Overview

This pipeline is specifically designed for assembling mitochondrial genomes from Illumina sequencing data. Mitochondrial DNA is typically present at much higher coverage than nuclear DNA, allowing for coverage-based separation and specialized assembly approaches. The pipeline uses k-mer counting to identify coverage peaks and applies targeted filtering to isolate mitochondrial sequences.

Prerequisites

System Requirements

BBTools suite installed
Quast for assembly evaluation
JGI-RQC-synbio tools (for supported systems)

Input Requirements

Illumina sequencing reads file linked as "reads.fq.gz"
Sufficient memory for k-mer counting and assembly
Data containing mitochondrial sequences (typically whole genome sequencing)

Pipeline Stages

1. Setup and Module Loading

Loads required dependencies based on system (currently supports Genepool, with placeholders for denovo and cori systems):

BBTools
Quast
JGI-RQC-synbio

2. Initial Coverage Analysis

2.1 Raw Data K-mer Counting

kmercountexact.sh in=reads.fq.gz khist=khist_raw.txt peaks=peaks_raw.txt

Generates k-mer histogram and identifies coverage peaks in the raw data to determine baseline coverage levels.

2.2 Primary Coverage Calculation

primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
cutoff=$(( $primary * 3 ))

Extracts the haploid fold coverage from the peak analysis and calculates a high-pass filter cutoff at 3x primary coverage.

3. High-Coverage Read Isolation

3.1 Coverage-Based Filtering

bbnorm.sh in=reads.fq.gz out=highpass.fq.gz pigz passes=1 bits=16 min=$cutoff target=9999999

Filters reads based on k-mer coverage, retaining only high-coverage reads likely to be mitochondrial. The high target value prevents normalization while applying the minimum coverage filter.

3.2 GC Content Filtering

reformat.sh in=highpass.fq.gz out=highpass_gc.fq.gz maxgc=0.45

Filters reads by GC content, removing reads with GC content above 45% to focus on mitochondrial sequences (which typically have lower GC content than nuclear DNA).

4. Mitochondrial Peak Detection

4.1 High-Resolution K-mer Analysis

kmercountexact.sh in=highpass_gc.fq.gz khist=khist_100.txt k=100 peaks=peaks_100.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2

Performs detailed k-mer counting on filtered reads using k=100 with sophisticated smoothing to identify the mitochondrial coverage peak clearly.

4.2 Assembly Parameters Calculation

mitopeak=`grep "main_peak" peaks_100.txt | sed "s/^.*\t//g"`

upper=$((mitopeak * 6 / 3))     # 2x mitopeak
lower=$((mitopeak * 3 / 7))     # ~0.43x mitopeak
mcs=$((mitopeak * 3 / 4))       # 0.75x mitopeak
mincov=$((mitopeak * 2 / 3))    # ~0.67x mitopeak

Calculates assembly parameters based on the mitochondrial peak coverage to optimize assembly for mitochondrial sequences.

5. Mitochondrial Assembly

5.1 Initial Assembly with TadWrapper

tadwrapper.sh in=highpass_gc.fq.gz out=contigs_intermediate_%.fa k=78,100,120 outfinal=contigs_intermediate.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov

Performs multi-k assembly optimized for mitochondrial coverage patterns using coverage constraints derived from the mitochondrial peak.

5.2 Read Selection for Final Assembly

bbduk.sh in=highpass.fq.gz ref=contigs_intermediate.fa outm=bbd005.fq.gz k=31 mm=f mkf=0.05

Uses the intermediate assembly as a reference to select reads that match mitochondrial sequences, further purifying the read set.

5.3 Final Mitochondrial Assembly

tadpole.sh in=bbd005.fq.gz out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 8)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=100 bm1=6

Generates the final mitochondrial assembly using the purified read set with optimized parameters including bubble merging (bm1=6) to handle repeat regions.

5.4 Final Output

ln -s contigs_bbd.fa contigs.fa

Creates a symbolic link to the final assembly for standardized output naming.

Basic Usage

# 1. Link your Illumina reads
ln -s your_illumina_reads.fq.gz reads.fq.gz

# 2. Run the mitochondrial assembly pipeline
bash assembleMito_Illumina.sh

# 3. Final assembly will be available as contigs.fa

Algorithm Details

Coverage-Based Mitochondrial Isolation

The pipeline exploits the fact that mitochondrial DNA is typically present at 10-1000x higher coverage than nuclear DNA in whole genome sequencing:

Primary Peak Detection: Identifies the nuclear genome coverage peak
High-Pass Filtering: Retains reads with coverage ≥3x nuclear peak
Mitochondrial Peak Analysis: Re-analyzes filtered data to identify mitochondrial peak
Optimized Assembly: Uses mitochondrial-specific coverage constraints

Parameter Calculations

Upper limit: 2x mitochondrial peak (handles coverage variation)
Lower limit: ~0.43x mitochondrial peak (excludes low-coverage artifacts)
Minimum contig size (mcs): 0.75x mitochondrial peak
Minimum coverage (mincov): ~0.67x mitochondrial peak

Output Files

contigs.fa - Final mitochondrial assembly
khist_raw.txt - K-mer histogram of raw data
peaks_raw.txt - Coverage peak analysis of raw data
khist_100.txt - K-mer histogram of filtered data (k=100)
peaks_100.txt - Mitochondrial peak analysis
highpass.fq.gz - High-coverage filtered reads
highpass_gc.fq.gz - GC and coverage filtered reads
contigs_intermediate.fa - Intermediate assembly results
bbd005.fq.gz - Mitochondrial-matching reads for final assembly

Notes and Tips

This pipeline is specifically optimized for Illumina data and mitochondrial genome characteristics
GC filtering (maxgc=0.45) is based on typical mitochondrial GC content being lower than nuclear DNA
The two-stage assembly approach (intermediate + final) helps purify mitochondrial sequences
Coverage parameters are automatically calculated based on detected peaks
For very divergent mitochondrial genomes, consider adjusting GC content thresholds
Monitor peak detection results to ensure proper mitochondrial peak identification

Troubleshooting

No clear mitochondrial peak: Check if sample contains sufficient mitochondrial DNA
Assembly fragmented: Consider adjusting coverage thresholds or k-mer sizes
Low yield: Verify GC content filtering isn't too restrictive for your organism
Memory issues: The prefilter=2 flag is already included for memory-intensive steps