Mitochondrial Assembly Pipeline (PacBio)

Script: assembleMito_PacBio.sh Authors: Brian Bushnell and William Andreopoulos Platform: Error-corrected PacBio reads

Specialized pipeline for assembling mitochondrial genomes from error-corrected PacBio long reads. Leverages the high coverage of mitochondrial DNA and uses coverage-based filtering with multi-k assembly approaches optimized for long-read data.

Overview

This pipeline is designed specifically for assembling mitochondrial genomes from error-corrected PacBio long reads. It exploits the typically higher coverage of mitochondrial DNA compared to nuclear DNA, using sophisticated coverage analysis and multi-k assembly strategies optimized for the characteristics of long-read sequencing data.

Important: This pipeline requires error-corrected PacBio reads. Raw PacBio reads should be error-corrected using tools like Canu, Falcon, or other long-read error correction methods before using this pipeline.

Prerequisites

System Requirements

Input Requirements

Pipeline Stages

1. Setup and Module Loading

Loads required dependencies based on system (currently supports Genepool, with placeholders for denovo and cori systems):

2. Initial Coverage Analysis

2.1 Raw Data K-mer Counting

kmercountexact.sh in=reads.fa khist=khist_raw.txt peaks=peaks_raw.txt

Generates k-mer histogram and identifies coverage peaks in the raw error-corrected PacBio data to determine baseline coverage levels.

2.2 Primary Coverage Calculation

primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
cutoff=$(( $primary * 3 ))

Extracts the haploid fold coverage and calculates a high-pass filter cutoff at 3x primary coverage, similar to the Illumina pipeline but adapted for long-read characteristics.

3. High-Coverage Read Isolation

3.1 Coverage-Based Filtering

bbnorm.sh in=reads.fa out=highpass.fa pigz passes=1 bits=16 min=$cutoff target=9999999

Filters reads based on k-mer coverage, retaining only high-coverage reads likely to be mitochondrial. Note the FASTA format for long reads.

3.2 GC Content Filtering

reformat.sh in=highpass.fa out=highpass_gc.fa maxgc=0.4

Filters reads by GC content with a slightly lower threshold (40%) compared to Illumina pipeline, accounting for potential differences in PacBio error patterns and mitochondrial GC content.

4. Optional Machine Learning Filtration

The pipeline includes a placeholder for optional machine-learning based filtration, which could be used to further refine read selection based on learned characteristics of mitochondrial sequences.

5. Mitochondrial Peak Detection

5.1 High-Resolution K-mer Analysis

kmercountexact.sh in=highpass_gc.fa khist=khist_124.txt k=124 peaks=peaks_124.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2

Performs detailed k-mer counting using k=124 (larger than Illumina pipeline's k=100) to better suit long-read characteristics, with sophisticated smoothing for clear mitochondrial peak identification.

5.2 Assembly Parameters Calculation

mitopeak=`grep "main_peak" peaks_124.txt | sed "s/^.*\t//g"`

upper=$((mitopeak * 5 / 3))     # ~1.67x mitopeak
lower=$((mitopeak * 1 / 2))     # 0.5x mitopeak
mcs=$((mitopeak * 4 / 5))       # 0.8x mitopeak
mincov=$((mitopeak * 3 / 4))    # 0.75x mitopeak

Calculates assembly parameters optimized for PacBio long reads, with different ratios compared to the Illumina pipeline to account for long-read assembly characteristics.

6. Multi-K Assembly Strategy

6.1 Intermediate Assembly with TadWrapper

tadwrapper.sh in=highpass_gc.fq.gz out=contigs_intermediate_%.fa k=93,124,155 outfinal=contigs_intermediate.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov

Performs multi-k assembly with larger k-values (93,124,155) suitable for long reads, using coverage constraints derived from the mitochondrial peak.

6.2 Individual K-mer Assemblies

tadpole.sh in=filtered.fa out=contigs93.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=93
tadpole.sh in=filtered.fa out=contigs124.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=124
tadpole.sh in=filtered.fa out=contigs155.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=155

Generates individual assemblies at different k-values to explore assembly space and handle potential variations in repeat structure.

7. Final Assembly Refinement

7.1 Read Selection for Final Assembly

bbduk.sh in=highpass.fa ref=contigs_intermediate.fa outm=bbd005.fa k=31 mm=f mkf=0.05

Uses the intermediate assembly as a reference to select reads that match mitochondrial sequences, further purifying the read set for final assembly.

7.2 Final Mitochondrial Assembly

tadpole.sh in=bbd005.fa out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 7)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=155

Generates the final mitochondrial assembly using the largest k-value (155) and refined coverage parameters, with relaxed minimum coverage ratio for final assembly.

7.3 Final Output

ln -s contigs_bbd.fa contigs.fa

Creates a symbolic link to the final assembly for standardized output naming.

Basic Usage

# 1. Link your error-corrected PacBio reads
ln -s your_corrected_pacbio_reads.fa reads.fa

# 2. Run the mitochondrial assembly pipeline
bash assembleMito_PacBio.sh

# 3. Final assembly will be available as contigs.fa

Algorithm Details

Long-Read Adaptations

The pipeline is adapted for PacBio long reads in several key ways:

  1. Larger K-values: Uses k=93,124,155 instead of shorter k-mers, suitable for long reads
  2. FASTA Input: Designed for error-corrected reads typically in FASTA format
  3. Adjusted GC Threshold: Uses maxgc=0.4 instead of 0.45 for different error patterns
  4. Coverage Ratios: Different parameter ratios optimized for long-read assembly

Multi-K Assembly Strategy

The pipeline uses multiple approaches:

Parameter Calculations for PacBio

Output Files

Differences from Illumina Pipeline

Aspect Illumina PacBio Rationale
Input Format FASTQ (.fq.gz) FASTA (.fa) Error-corrected long reads typically in FASTA
K-mer Analysis k=100 k=124 Larger k-mers better for long reads
Assembly K-values k=78,100,120 k=93,124,155 Higher k-values leverage long-read advantages
GC Threshold maxgc=0.45 maxgc=0.4 Adjusted for PacBio error characteristics
Coverage Ratios More permissive More stringent Long reads allow higher specificity

Notes and Tips

Troubleshooting