Tadpipe

Script: tadpipe.sh Package: assemble Class: TadPipe.java

Runs TadpoleWrapper after some preprocessing, to allow optimal assemblies using long kmers. Only paired reads are supported.

Basic Usage

tadpipe.sh in=reads.fq out=contigs.fa

TadPipe executes a seven-stage assembly pipeline that performs sequential preprocessing steps before final assembly to enable long kmer utilization. The pipeline chains BBDuk, BBMerge, Clumpify, Tadpole, and TadpoleWrapper to process adapter trimming, error correction, read merging, extension, and multi-k assembly.

Parameters

Parameters are organized into basic control parameters and phase-specific parameters using prefixes. Phase-specific parameters are passed to individual tools in the pipeline by using the appropriate prefix.

Basic Parameters

in=<file>
Input reads file. Required parameter specifying the primary input file containing sequencing reads.
in2=<file>
Optional read 2, if reads are in two files. Use this when input reads are in separate paired-end files.
out=contigs.fa
Output file name. Specifies the final assembly output file. Default: contigs.fa
temp=$TMPDIR
Path to a directory for temporary files. The pipeline creates multiple intermediate files during processing. Uses system TMPDIR by default.
delete=t
Delete intermediate files after completion. Set to false (f) to retain intermediate files for debugging. Default: true
gz=f
Gzip intermediate files to save disk space during processing. Default: false

Phase-Specific Parameter Examples

Parameters can be passed to individual phases by prefixing them with the phase name. Here are examples of how to use phase-specific parameters:

assemble_k=200,250
Set kmer lengths for the assembly phase. This passes k=200,250 to the final TadpoleWrapper assembly step.
merge_strict
Set the strict flag in the merge phase. This passes strict=true to BBMerge during the paired-read merging step.
extend_el=120
Set the left-extension distance in the extension phase. This passes el=120 to Tadpole during read extension.
merge_k=75
Set kmer length for merging phase. Example of passing k parameter specifically to BBMerge.
correct_k=50
Set kmer length for error correction phase. Example of passing k parameter to Tadpole error correction.
trim_k=23
Set kmer length for adapter trimming phase. Example of passing k parameter to BBDuk trimming.
assemble_expand
Enable expansion mode in final assembly. Example of passing expand flag to TadpoleWrapper.

Valid Phase Prefixes

Use these prefixes to pass parameters to specific phases of the pipeline:

filter_
PhiX and contaminant filtering phase. Parameters are passed to BBDuk for contamination removal (currently disabled in implementation).
trim_
Adapter trimming phase. Parameters are passed to BBDuk for adapter and quality trimming. Default settings include k=23, mink=11, hdist=1, ktrim=r, qtrim=r, trimq=10.
merge_
Paired-read merging phase. Parameters are passed to BBMerge. Default settings include k=75, extend2=120, rem, ecct, adapters=default.
correct_
Error correction phase. Parameters are passed to Tadpole for kmer-based error correction. Default settings include k=50, ecc, tossjunk, deadzone=2.
extend_
Read extension phase. Parameters are passed to Tadpole for extending reads. Default settings include k=81, mode=extend, el=100, er=100.
assemble_
Final assembly phase. Parameters are passed to TadpoleWrapper for multi-k assembly. Default settings include k=210,250,290, expand, bisect, shave, rinse, pop.
ecco_
Error correction and overlap phase. Parameters are passed to BBMerge in error-correction mode with strict settings and adapter detection.
clump_
Clumpify phase. Parameters are passed to Clumpify for error correction and read organization. Default settings include ecc, passes=8, unpair, repair.
extend2_
Second extension phase (optional). Parameters are passed to Tadpole for additional read extension with longer kmers (k=124) when extend2 mode is enabled.

Examples

Basic Assembly

tadpipe.sh in=reads.fq out=contigs.fa

Basic usage with paired-end reads in a single interleaved file. Uses default parameters for all phases.

Separate Paired Files

tadpipe.sh in=reads_R1.fq in2=reads_R2.fq out=assembly.fa

Assembly from separate paired-end read files (R1 and R2).

Custom Assembly Parameters

tadpipe.sh in=reads.fq out=contigs.fa assemble_k=150,200,250,300 assemble_expand assemble_bisect

Using custom kmer lengths for assembly and enabling expansion and bisection modes for better assembly quality.

Custom Merge and Extension

tadpipe.sh in=reads.fq out=contigs.fa merge_k=62 merge_strict extend_el=150 extend_er=150

Customizing the merge phase with shorter kmers and strict mode, and increasing extension distances.

Retain Intermediate Files

tadpipe.sh in=reads.fq out=contigs.fa delete=f temp=/scratch/assembly_temp/

Keeping intermediate files for debugging and using a custom temporary directory.

Memory-Efficient Processing

tadpipe.sh in=reads.fq out=contigs.fa gz=t temp=/fast_storage/tmp/

Using gzip compression for intermediate files to save disk space, with temporary files on fast storage.

Algorithm Details

Pipeline Overview

TadPipe implements a multi-stage assembly pipeline with seven sequential phases that enable long kmer assembly through extensive preprocessing. The pipeline executes these phases in fixed order:

Phase 1: Adapter Trimming

Uses BBDuk to remove adapter sequences and perform quality trimming. Default parameters include:

Phase 2: Error Correction and Overlap (ECCO)

Uses BBMerge in error correction and overlap detection mode to:

Phase 3: Clumpify Processing

Uses Clumpify to organize and further correct reads:

Phase 4: Paired-Read Merging

Uses BBMerge to merge overlapping paired-end reads:

Phase 5: Kmer-Based Error Correction

Uses Tadpole for kmer-frequency-based error correction:

Phase 6: Read Extension

Uses Tadpole to extend reads using kmer overlap:

Phase 7: Multi-K Assembly

Uses TadpoleWrapper for final assembly with multiple kmer lengths:

Performance Characteristics

The pipeline resource requirements:

Algorithm Features

Implementation Strategy

The pipeline uses a dual data structure approach:

Support

For questions and support: