Mitochondrial Assembly Pipeline (PacBio)
Specialized pipeline for assembling mitochondrial genomes from error-corrected PacBio long reads. Leverages the high coverage of mitochondrial DNA and uses coverage-based filtering with multi-k assembly approaches optimized for long-read data.
Overview
This pipeline is designed specifically for assembling mitochondrial genomes from error-corrected PacBio long reads. It exploits the typically higher coverage of mitochondrial DNA compared to nuclear DNA, using sophisticated coverage analysis and multi-k assembly strategies optimized for the characteristics of long-read sequencing data.
Prerequisites
System Requirements
- BBTools suite installed
- Quast for assembly evaluation
- JGI-RQC-synbio tools (for supported systems)
Input Requirements
- Error-corrected PacBio reads file linked as "reads.fa"
- FASTA format (not FASTQ) as typical for error-corrected long reads
- Sufficient memory for k-mer counting and assembly with long reads
- Data containing mitochondrial sequences (typically whole genome sequencing)
Pipeline Stages
1. Setup and Module Loading
Loads required dependencies based on system (currently supports Genepool, with placeholders for denovo and cori systems):
- BBTools
- Quast
- JGI-RQC-synbio
2. Initial Coverage Analysis
2.1 Raw Data K-mer Counting
kmercountexact.sh in=reads.fa khist=khist_raw.txt peaks=peaks_raw.txt
Generates k-mer histogram and identifies coverage peaks in the raw error-corrected PacBio data to determine baseline coverage levels.
2.2 Primary Coverage Calculation
primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
cutoff=$(( $primary * 3 ))
Extracts the haploid fold coverage and calculates a high-pass filter cutoff at 3x primary coverage, similar to the Illumina pipeline but adapted for long-read characteristics.
3. High-Coverage Read Isolation
3.1 Coverage-Based Filtering
bbnorm.sh in=reads.fa out=highpass.fa pigz passes=1 bits=16 min=$cutoff target=9999999
Filters reads based on k-mer coverage, retaining only high-coverage reads likely to be mitochondrial. Note the FASTA format for long reads.
3.2 GC Content Filtering
reformat.sh in=highpass.fa out=highpass_gc.fa maxgc=0.4
Filters reads by GC content with a slightly lower threshold (40%) compared to Illumina pipeline, accounting for potential differences in PacBio error patterns and mitochondrial GC content.
4. Optional Machine Learning Filtration
The pipeline includes a placeholder for optional machine-learning based filtration, which could be used to further refine read selection based on learned characteristics of mitochondrial sequences.
5. Mitochondrial Peak Detection
5.1 High-Resolution K-mer Analysis
kmercountexact.sh in=highpass_gc.fa khist=khist_124.txt k=124 peaks=peaks_124.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2
Performs detailed k-mer counting using k=124 (larger than Illumina pipeline's k=100) to better suit long-read characteristics, with sophisticated smoothing for clear mitochondrial peak identification.
5.2 Assembly Parameters Calculation
mitopeak=`grep "main_peak" peaks_124.txt | sed "s/^.*\t//g"`
upper=$((mitopeak * 5 / 3)) # ~1.67x mitopeak
lower=$((mitopeak * 1 / 2)) # 0.5x mitopeak
mcs=$((mitopeak * 4 / 5)) # 0.8x mitopeak
mincov=$((mitopeak * 3 / 4)) # 0.75x mitopeak
Calculates assembly parameters optimized for PacBio long reads, with different ratios compared to the Illumina pipeline to account for long-read assembly characteristics.
6. Multi-K Assembly Strategy
6.1 Intermediate Assembly with TadWrapper
tadwrapper.sh in=highpass_gc.fq.gz out=contigs_intermediate_%.fa k=93,124,155 outfinal=contigs_intermediate.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov
Performs multi-k assembly with larger k-values (93,124,155) suitable for long reads, using coverage constraints derived from the mitochondrial peak.
6.2 Individual K-mer Assemblies
tadpole.sh in=filtered.fa out=contigs93.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=93
tadpole.sh in=filtered.fa out=contigs124.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=124
tadpole.sh in=filtered.fa out=contigs155.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=155
Generates individual assemblies at different k-values to explore assembly space and handle potential variations in repeat structure.
7. Final Assembly Refinement
7.1 Read Selection for Final Assembly
bbduk.sh in=highpass.fa ref=contigs_intermediate.fa outm=bbd005.fa k=31 mm=f mkf=0.05
Uses the intermediate assembly as a reference to select reads that match mitochondrial sequences, further purifying the read set for final assembly.
7.2 Final Mitochondrial Assembly
tadpole.sh in=bbd005.fa out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 7)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=155
Generates the final mitochondrial assembly using the largest k-value (155) and refined coverage parameters, with relaxed minimum coverage ratio for final assembly.
7.3 Final Output
ln -s contigs_bbd.fa contigs.fa
Creates a symbolic link to the final assembly for standardized output naming.
Basic Usage
# 1. Link your error-corrected PacBio reads
ln -s your_corrected_pacbio_reads.fa reads.fa
# 2. Run the mitochondrial assembly pipeline
bash assembleMito_PacBio.sh
# 3. Final assembly will be available as contigs.fa
Algorithm Details
Long-Read Adaptations
The pipeline is adapted for PacBio long reads in several key ways:
- Larger K-values: Uses k=93,124,155 instead of shorter k-mers, suitable for long reads
- FASTA Input: Designed for error-corrected reads typically in FASTA format
- Adjusted GC Threshold: Uses maxgc=0.4 instead of 0.45 for different error patterns
- Coverage Ratios: Different parameter ratios optimized for long-read assembly
Multi-K Assembly Strategy
The pipeline uses multiple approaches:
- TadWrapper: Multi-k assembly combining k=93,124,155
- Individual assemblies: Separate assemblies at each k-value for comparison
- Final assembly: High-k (155) assembly with refined parameters
Parameter Calculations for PacBio
- Upper limit: ~1.67x mitochondrial peak (more conservative than Illumina)
- Lower limit: 0.5x mitochondrial peak (higher than Illumina for long-read specificity)
- Minimum contig size: 0.8x mitochondrial peak
- Minimum coverage: 0.75x mitochondrial peak
Output Files
- contigs.fa - Final mitochondrial assembly
- khist_raw.txt - K-mer histogram of raw data
- peaks_raw.txt - Coverage peak analysis of raw data
- khist_124.txt - K-mer histogram of filtered data (k=124)
- peaks_124.txt - Mitochondrial peak analysis
- highpass.fa - High-coverage filtered reads
- highpass_gc.fa - GC and coverage filtered reads
- contigs_intermediate.fa - TadWrapper intermediate assembly
- contigs93.fa, contigs124.fa, contigs155.fa - Individual k-value assemblies
- bbd005.fa - Mitochondrial-matching reads for final assembly
- contigs_bbd.fa - Final assembly before linking
Differences from Illumina Pipeline
Aspect | Illumina | PacBio | Rationale |
---|---|---|---|
Input Format | FASTQ (.fq.gz) | FASTA (.fa) | Error-corrected long reads typically in FASTA |
K-mer Analysis | k=100 | k=124 | Larger k-mers better for long reads |
Assembly K-values | k=78,100,120 | k=93,124,155 | Higher k-values leverage long-read advantages |
GC Threshold | maxgc=0.45 | maxgc=0.4 | Adjusted for PacBio error characteristics |
Coverage Ratios | More permissive | More stringent | Long reads allow higher specificity |
Notes and Tips
- This pipeline requires error-corrected PacBio reads - raw reads will not work well
- The multi-k approach provides flexibility for different mitochondrial genome structures
- Monitor individual k-value assemblies to understand optimal parameters for your data
- GC filtering is more stringent (40%) to account for PacBio-specific characteristics
- The optional machine learning filtration step can be customized for specific organisms
- Higher k-values (up to 155) leverage the length advantage of PacBio reads
- Coverage parameters are automatically calculated but may need adjustment for unusual coverage patterns
Troubleshooting
- Poor assembly quality: Verify reads are properly error-corrected before input
- No mitochondrial peak: Check coverage depth and mitochondrial content of sample
- Fragmented assembly: Try adjusting k-values or coverage parameters
- Low yield after filtering: Consider adjusting GC threshold for your specific organism
- Memory issues: Prefilter=2 is included but may need system-specific tuning
- Assembly differences: Compare results from different k-values to choose optimal assembly