Mitochondrial Assembly Pipeline (Illumina)
Specialized pipeline for assembling mitochondrial genomes from Illumina sequencing data. Uses coverage-based filtering to isolate mitochondrial reads, followed by targeted assembly optimized for the high coverage and unique characteristics of mitochondrial DNA.
Overview
This pipeline is specifically designed for assembling mitochondrial genomes from Illumina sequencing data. Mitochondrial DNA is typically present at much higher coverage than nuclear DNA, allowing for coverage-based separation and specialized assembly approaches. The pipeline uses k-mer counting to identify coverage peaks and applies targeted filtering to isolate mitochondrial sequences.
Prerequisites
System Requirements
- BBTools suite installed
- Quast for assembly evaluation
- JGI-RQC-synbio tools (for supported systems)
Input Requirements
- Illumina sequencing reads file linked as "reads.fq.gz"
- Sufficient memory for k-mer counting and assembly
- Data containing mitochondrial sequences (typically whole genome sequencing)
Pipeline Stages
1. Setup and Module Loading
Loads required dependencies based on system (currently supports Genepool, with placeholders for denovo and cori systems):
- BBTools
- Quast
- JGI-RQC-synbio
2. Initial Coverage Analysis
2.1 Raw Data K-mer Counting
kmercountexact.sh in=reads.fq.gz khist=khist_raw.txt peaks=peaks_raw.txt
Generates k-mer histogram and identifies coverage peaks in the raw data to determine baseline coverage levels.
2.2 Primary Coverage Calculation
primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
cutoff=$(( $primary * 3 ))
Extracts the haploid fold coverage from the peak analysis and calculates a high-pass filter cutoff at 3x primary coverage.
3. High-Coverage Read Isolation
3.1 Coverage-Based Filtering
bbnorm.sh in=reads.fq.gz out=highpass.fq.gz pigz passes=1 bits=16 min=$cutoff target=9999999
Filters reads based on k-mer coverage, retaining only high-coverage reads likely to be mitochondrial. The high target value prevents normalization while applying the minimum coverage filter.
3.2 GC Content Filtering
reformat.sh in=highpass.fq.gz out=highpass_gc.fq.gz maxgc=0.45
Filters reads by GC content, removing reads with GC content above 45% to focus on mitochondrial sequences (which typically have lower GC content than nuclear DNA).
4. Mitochondrial Peak Detection
4.1 High-Resolution K-mer Analysis
kmercountexact.sh in=highpass_gc.fq.gz khist=khist_100.txt k=100 peaks=peaks_100.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2
Performs detailed k-mer counting on filtered reads using k=100 with sophisticated smoothing to identify the mitochondrial coverage peak clearly.
4.2 Assembly Parameters Calculation
mitopeak=`grep "main_peak" peaks_100.txt | sed "s/^.*\t//g"`
upper=$((mitopeak * 6 / 3)) # 2x mitopeak
lower=$((mitopeak * 3 / 7)) # ~0.43x mitopeak
mcs=$((mitopeak * 3 / 4)) # 0.75x mitopeak
mincov=$((mitopeak * 2 / 3)) # ~0.67x mitopeak
Calculates assembly parameters based on the mitochondrial peak coverage to optimize assembly for mitochondrial sequences.
5. Mitochondrial Assembly
5.1 Initial Assembly with TadWrapper
tadwrapper.sh in=highpass_gc.fq.gz out=contigs_intermediate_%.fa k=78,100,120 outfinal=contigs_intermediate.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov
Performs multi-k assembly optimized for mitochondrial coverage patterns using coverage constraints derived from the mitochondrial peak.
5.2 Read Selection for Final Assembly
bbduk.sh in=highpass.fq.gz ref=contigs_intermediate.fa outm=bbd005.fq.gz k=31 mm=f mkf=0.05
Uses the intermediate assembly as a reference to select reads that match mitochondrial sequences, further purifying the read set.
5.3 Final Mitochondrial Assembly
tadpole.sh in=bbd005.fq.gz out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 8)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=100 bm1=6
Generates the final mitochondrial assembly using the purified read set with optimized parameters including bubble merging (bm1=6) to handle repeat regions.
5.4 Final Output
ln -s contigs_bbd.fa contigs.fa
Creates a symbolic link to the final assembly for standardized output naming.
Basic Usage
# 1. Link your Illumina reads
ln -s your_illumina_reads.fq.gz reads.fq.gz
# 2. Run the mitochondrial assembly pipeline
bash assembleMito_Illumina.sh
# 3. Final assembly will be available as contigs.fa
Algorithm Details
Coverage-Based Mitochondrial Isolation
The pipeline exploits the fact that mitochondrial DNA is typically present at 10-1000x higher coverage than nuclear DNA in whole genome sequencing:
- Primary Peak Detection: Identifies the nuclear genome coverage peak
- High-Pass Filtering: Retains reads with coverage ≥3x nuclear peak
- Mitochondrial Peak Analysis: Re-analyzes filtered data to identify mitochondrial peak
- Optimized Assembly: Uses mitochondrial-specific coverage constraints
Parameter Calculations
- Upper limit: 2x mitochondrial peak (handles coverage variation)
- Lower limit: ~0.43x mitochondrial peak (excludes low-coverage artifacts)
- Minimum contig size (mcs): 0.75x mitochondrial peak
- Minimum coverage (mincov): ~0.67x mitochondrial peak
Output Files
- contigs.fa - Final mitochondrial assembly
- khist_raw.txt - K-mer histogram of raw data
- peaks_raw.txt - Coverage peak analysis of raw data
- khist_100.txt - K-mer histogram of filtered data (k=100)
- peaks_100.txt - Mitochondrial peak analysis
- highpass.fq.gz - High-coverage filtered reads
- highpass_gc.fq.gz - GC and coverage filtered reads
- contigs_intermediate.fa - Intermediate assembly results
- bbd005.fq.gz - Mitochondrial-matching reads for final assembly
Notes and Tips
- This pipeline is specifically optimized for Illumina data and mitochondrial genome characteristics
- GC filtering (maxgc=0.45) is based on typical mitochondrial GC content being lower than nuclear DNA
- The two-stage assembly approach (intermediate + final) helps purify mitochondrial sequences
- Coverage parameters are automatically calculated based on detected peaks
- For very divergent mitochondrial genomes, consider adjusting GC content thresholds
- Monitor peak detection results to ensure proper mitochondrial peak identification
Troubleshooting
- No clear mitochondrial peak: Check if sample contains sufficient mitochondrial DNA
- Assembly fragmented: Consider adjusting coverage thresholds or k-mer sizes
- Low yield: Verify GC content filtering isn't too restrictive for your organism
- Memory issues: The prefilter=2 flag is already included for memory-intensive steps