Mitochondrial Assembly Pipeline (PacBio)

Overview

This pipeline is designed specifically for assembling mitochondrial genomes from error-corrected PacBio long reads. It exploits the typically higher coverage of mitochondrial DNA compared to nuclear DNA, using sophisticated coverage analysis and multi-k assembly strategies optimized for the characteristics of long-read sequencing data.

Important: This pipeline requires error-corrected PacBio reads. Raw PacBio reads should be error-corrected using tools like Canu, Falcon, or other long-read error correction methods before using this pipeline.

Prerequisites

System Requirements

BBTools suite installed
Quast for assembly evaluation
JGI-RQC-synbio tools (for supported systems)

Input Requirements

Error-corrected PacBio reads file linked as "reads.fa"
FASTA format (not FASTQ) as typical for error-corrected long reads
Sufficient memory for k-mer counting and assembly with long reads
Data containing mitochondrial sequences (typically whole genome sequencing)

Pipeline Stages

1. Setup and Module Loading

Loads required dependencies based on system (currently supports Genepool, with placeholders for denovo and cori systems):

BBTools
Quast
JGI-RQC-synbio

2. Initial Coverage Analysis

2.1 Raw Data K-mer Counting

kmercountexact.sh in=reads.fa khist=khist_raw.txt peaks=peaks_raw.txt

Generates k-mer histogram and identifies coverage peaks in the raw error-corrected PacBio data to determine baseline coverage levels.

2.2 Primary Coverage Calculation

primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
cutoff=$(( $primary * 3 ))

Extracts the haploid fold coverage and calculates a high-pass filter cutoff at 3x primary coverage, similar to the Illumina pipeline but adapted for long-read characteristics.

3. High-Coverage Read Isolation

3.1 Coverage-Based Filtering

bbnorm.sh in=reads.fa out=highpass.fa pigz passes=1 bits=16 min=$cutoff target=9999999

Filters reads based on k-mer coverage, retaining only high-coverage reads likely to be mitochondrial. Note the FASTA format for long reads.

3.2 GC Content Filtering

reformat.sh in=highpass.fa out=highpass_gc.fa maxgc=0.4

Filters reads by GC content with a slightly lower threshold (40%) compared to Illumina pipeline, accounting for potential differences in PacBio error patterns and mitochondrial GC content.

4. Optional Machine Learning Filtration

The pipeline includes a placeholder for optional machine-learning based filtration, which could be used to further refine read selection based on learned characteristics of mitochondrial sequences.

5. Mitochondrial Peak Detection

5.1 High-Resolution K-mer Analysis

kmercountexact.sh in=highpass_gc.fa khist=khist_124.txt k=124 peaks=peaks_124.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2

Performs detailed k-mer counting using k=124 (larger than Illumina pipeline's k=100) to better suit long-read characteristics, with sophisticated smoothing for clear mitochondrial peak identification.

5.2 Assembly Parameters Calculation

mitopeak=`grep "main_peak" peaks_124.txt | sed "s/^.*\t//g"`

upper=$((mitopeak * 5 / 3))     # ~1.67x mitopeak
lower=$((mitopeak * 1 / 2))     # 0.5x mitopeak
mcs=$((mitopeak * 4 / 5))       # 0.8x mitopeak
mincov=$((mitopeak * 3 / 4))    # 0.75x mitopeak

Calculates assembly parameters optimized for PacBio long reads, with different ratios compared to the Illumina pipeline to account for long-read assembly characteristics.

6. Multi-K Assembly Strategy

6.1 Intermediate Assembly with TadWrapper

tadwrapper.sh in=highpass_gc.fq.gz out=contigs_intermediate_%.fa k=93,124,155 outfinal=contigs_intermediate.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov

Performs multi-k assembly with larger k-values (93,124,155) suitable for long reads, using coverage constraints derived from the mitochondrial peak.

6.2 Individual K-mer Assemblies

tadpole.sh in=filtered.fa out=contigs93.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=93
tadpole.sh in=filtered.fa out=contigs124.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=124
tadpole.sh in=filtered.fa out=contigs155.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=155

Generates individual assemblies at different k-values to explore assembly space and handle potential variations in repeat structure.

7. Final Assembly Refinement

7.1 Read Selection for Final Assembly

bbduk.sh in=highpass.fa ref=contigs_intermediate.fa outm=bbd005.fa k=31 mm=f mkf=0.05

Uses the intermediate assembly as a reference to select reads that match mitochondrial sequences, further purifying the read set for final assembly.

7.2 Final Mitochondrial Assembly

tadpole.sh in=bbd005.fa out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 7)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=155

Generates the final mitochondrial assembly using the largest k-value (155) and refined coverage parameters, with relaxed minimum coverage ratio for final assembly.

7.3 Final Output

ln -s contigs_bbd.fa contigs.fa

Creates a symbolic link to the final assembly for standardized output naming.

Basic Usage

# 1. Link your error-corrected PacBio reads
ln -s your_corrected_pacbio_reads.fa reads.fa

# 2. Run the mitochondrial assembly pipeline
bash assembleMito_PacBio.sh

# 3. Final assembly will be available as contigs.fa

Algorithm Details

Long-Read Adaptations

The pipeline is adapted for PacBio long reads in several key ways:

Larger K-values: Uses k=93,124,155 instead of shorter k-mers, suitable for long reads
FASTA Input: Designed for error-corrected reads typically in FASTA format
Adjusted GC Threshold: Uses maxgc=0.4 instead of 0.45 for different error patterns
Coverage Ratios: Different parameter ratios optimized for long-read assembly

Multi-K Assembly Strategy

The pipeline uses multiple approaches:

TadWrapper: Multi-k assembly combining k=93,124,155
Individual assemblies: Separate assemblies at each k-value for comparison
Final assembly: High-k (155) assembly with refined parameters

Parameter Calculations for PacBio

Upper limit: ~1.67x mitochondrial peak (more conservative than Illumina)
Lower limit: 0.5x mitochondrial peak (higher than Illumina for long-read specificity)
Minimum contig size: 0.8x mitochondrial peak
Minimum coverage: 0.75x mitochondrial peak

Output Files

contigs.fa - Final mitochondrial assembly
khist_raw.txt - K-mer histogram of raw data
peaks_raw.txt - Coverage peak analysis of raw data
khist_124.txt - K-mer histogram of filtered data (k=124)
peaks_124.txt - Mitochondrial peak analysis
highpass.fa - High-coverage filtered reads
highpass_gc.fa - GC and coverage filtered reads
contigs_intermediate.fa - TadWrapper intermediate assembly
contigs93.fa, contigs124.fa, contigs155.fa - Individual k-value assemblies
bbd005.fa - Mitochondrial-matching reads for final assembly
contigs_bbd.fa - Final assembly before linking

Differences from Illumina Pipeline

Aspect	Illumina	PacBio	Rationale
Input Format	FASTQ (.fq.gz)	FASTA (.fa)	Error-corrected long reads typically in FASTA
K-mer Analysis	k=100	k=124	Larger k-mers better for long reads
Assembly K-values	k=78,100,120	k=93,124,155	Higher k-values leverage long-read advantages
GC Threshold	maxgc=0.45	maxgc=0.4	Adjusted for PacBio error characteristics
Coverage Ratios	More permissive	More stringent	Long reads allow higher specificity

Notes and Tips

This pipeline requires error-corrected PacBio reads - raw reads will not work well
The multi-k approach provides flexibility for different mitochondrial genome structures
Monitor individual k-value assemblies to understand optimal parameters for your data
GC filtering is more stringent (40%) to account for PacBio-specific characteristics
The optional machine learning filtration step can be customized for specific organisms
Higher k-values (up to 155) leverage the length advantage of PacBio reads
Coverage parameters are automatically calculated but may need adjustment for unusual coverage patterns

Troubleshooting

Poor assembly quality: Verify reads are properly error-corrected before input
No mitochondrial peak: Check coverage depth and mitochondrial content of sample
Fragmented assembly: Try adjusting k-values or coverage parameters
Low yield after filtering: Consider adjusting GC threshold for your specific organism
Memory issues: Prefilter=2 is included but may need system-specific tuning
Assembly differences: Compare results from different k-values to choose optimal assembly