Tadpipe
Runs TadpoleWrapper after some preprocessing, to allow optimal assemblies using long kmers. Only paired reads are supported.
Basic Usage
tadpipe.sh in=reads.fq out=contigs.fa
TadPipe executes a seven-stage assembly pipeline that performs sequential preprocessing steps before final assembly to enable long kmer utilization. The pipeline chains BBDuk, BBMerge, Clumpify, Tadpole, and TadpoleWrapper to process adapter trimming, error correction, read merging, extension, and multi-k assembly.
Parameters
Parameters are organized into basic control parameters and phase-specific parameters using prefixes. Phase-specific parameters are passed to individual tools in the pipeline by using the appropriate prefix.
Basic Parameters
- in=<file>
- Input reads file. Required parameter specifying the primary input file containing sequencing reads.
- in2=<file>
- Optional read 2, if reads are in two files. Use this when input reads are in separate paired-end files.
- out=contigs.fa
- Output file name. Specifies the final assembly output file. Default: contigs.fa
- temp=$TMPDIR
- Path to a directory for temporary files. The pipeline creates multiple intermediate files during processing. Uses system TMPDIR by default.
- delete=t
- Delete intermediate files after completion. Set to false (f) to retain intermediate files for debugging. Default: true
- gz=f
- Gzip intermediate files to save disk space during processing. Default: false
Phase-Specific Parameter Examples
Parameters can be passed to individual phases by prefixing them with the phase name. Here are examples of how to use phase-specific parameters:
- assemble_k=200,250
- Set kmer lengths for the assembly phase. This passes k=200,250 to the final TadpoleWrapper assembly step.
- merge_strict
- Set the strict flag in the merge phase. This passes strict=true to BBMerge during the paired-read merging step.
- extend_el=120
- Set the left-extension distance in the extension phase. This passes el=120 to Tadpole during read extension.
- merge_k=75
- Set kmer length for merging phase. Example of passing k parameter specifically to BBMerge.
- correct_k=50
- Set kmer length for error correction phase. Example of passing k parameter to Tadpole error correction.
- trim_k=23
- Set kmer length for adapter trimming phase. Example of passing k parameter to BBDuk trimming.
- assemble_expand
- Enable expansion mode in final assembly. Example of passing expand flag to TadpoleWrapper.
Valid Phase Prefixes
Use these prefixes to pass parameters to specific phases of the pipeline:
- filter_
- PhiX and contaminant filtering phase. Parameters are passed to BBDuk for contamination removal (currently disabled in implementation).
- trim_
- Adapter trimming phase. Parameters are passed to BBDuk for adapter and quality trimming. Default settings include k=23, mink=11, hdist=1, ktrim=r, qtrim=r, trimq=10.
- merge_
- Paired-read merging phase. Parameters are passed to BBMerge. Default settings include k=75, extend2=120, rem, ecct, adapters=default.
- correct_
- Error correction phase. Parameters are passed to Tadpole for kmer-based error correction. Default settings include k=50, ecc, tossjunk, deadzone=2.
- extend_
- Read extension phase. Parameters are passed to Tadpole for extending reads. Default settings include k=81, mode=extend, el=100, er=100.
- assemble_
- Final assembly phase. Parameters are passed to TadpoleWrapper for multi-k assembly. Default settings include k=210,250,290, expand, bisect, shave, rinse, pop.
- ecco_
- Error correction and overlap phase. Parameters are passed to BBMerge in error-correction mode with strict settings and adapter detection.
- clump_
- Clumpify phase. Parameters are passed to Clumpify for error correction and read organization. Default settings include ecc, passes=8, unpair, repair.
- extend2_
- Second extension phase (optional). Parameters are passed to Tadpole for additional read extension with longer kmers (k=124) when extend2 mode is enabled.
Examples
Basic Assembly
tadpipe.sh in=reads.fq out=contigs.fa
Basic usage with paired-end reads in a single interleaved file. Uses default parameters for all phases.
Separate Paired Files
tadpipe.sh in=reads_R1.fq in2=reads_R2.fq out=assembly.fa
Assembly from separate paired-end read files (R1 and R2).
Custom Assembly Parameters
tadpipe.sh in=reads.fq out=contigs.fa assemble_k=150,200,250,300 assemble_expand assemble_bisect
Using custom kmer lengths for assembly and enabling expansion and bisection modes for better assembly quality.
Custom Merge and Extension
tadpipe.sh in=reads.fq out=contigs.fa merge_k=62 merge_strict extend_el=150 extend_er=150
Customizing the merge phase with shorter kmers and strict mode, and increasing extension distances.
Retain Intermediate Files
tadpipe.sh in=reads.fq out=contigs.fa delete=f temp=/scratch/assembly_temp/
Keeping intermediate files for debugging and using a custom temporary directory.
Memory-Efficient Processing
tadpipe.sh in=reads.fq out=contigs.fa gz=t temp=/fast_storage/tmp/
Using gzip compression for intermediate files to save disk space, with temporary files on fast storage.
Algorithm Details
Pipeline Overview
TadPipe implements a multi-stage assembly pipeline with seven sequential phases that enable long kmer assembly through extensive preprocessing. The pipeline executes these phases in fixed order:
Phase 1: Adapter Trimming
Uses BBDuk to remove adapter sequences and perform quality trimming. Default parameters include:
- k=23, mink=11, hdist=1 for adapter detection
- ktrim=r for right-side trimming
- qtrim=r, trimq=10 for quality trimming
- tbo, tpe flags for trimming by overlap and paired-end mode
- minlen=62 to filter short reads
Phase 2: Error Correction and Overlap (ECCO)
Uses BBMerge in error correction and overlap detection mode to:
- Detect and correct sequencing errors using overlap information
- Identify adapter sequences automatically
- Apply strict quality standards for error correction
- Prepare reads for optimal merging
Phase 3: Clumpify Processing
Uses Clumpify to organize and further correct reads:
- Groups similar reads together for efficient processing
- Performs additional error correction (ecc flag)
- Runs multiple passes (passes=8) for thorough correction
- Handles unpaired reads and repairs pairing
Phase 4: Paired-Read Merging
Uses BBMerge to merge overlapping paired-end reads:
- k=75 for overlap detection
- extend2=120 for extension during merging
- rem flag to keep merging information
- ecct flag for error correction during merging
- Separates successfully merged reads from unmerged pairs
Phase 5: Kmer-Based Error Correction
Uses Tadpole for kmer-frequency-based error correction:
- k=50 for kmer-based error detection and correction
- ecc flag enables error correction mode
- tossjunk removes low-quality sequences
- deadzone=2 provides conservative error handling
- Processes both merged and unmerged reads
Phase 6: Read Extension
Uses Tadpole to extend reads using kmer overlap:
- k=81 for extension overlap detection
- mode=extend specifically enables extension mode
- el=100, er=100 for left and right extension distances
- Optional second extension phase with k=124 and shorter extensions (el=60, er=60)
Phase 7: Multi-K Assembly
Uses TadpoleWrapper for final assembly with multiple kmer lengths:
- Default kmers: k=210,250,290 for comprehensive assembly
- expand flag enables path expansion for complex regions
- bisect flag enables bisection algorithm for resolving ambiguities
- shave, rinse, pop flags enable post-processing steps
- Automatically selects best assembly from multiple k values
Performance Characteristics
The pipeline resource requirements:
- Memory Usage: Default 14GB (configurable via standard BBTools memory parameters)
- Disk Usage: Creates multiple intermediate files; use gz=t for compression
- Processing Time: Approximately 3-5x longer than simple assembly due to seven-stage preprocessing
- Output Quality: Enables assemblies with kmers up to 290bp through systematic preprocessing
Algorithm Features
- Long Kmer Support: Preprocessing enables assembly with kmers up to 290bp
- Error Correction Layers: Four distinct error correction phases (ECCO, Clumpify, Tadpole ECC, extension ECC)
- Adapter Processing: BBDuk with k=23/mink=11 adapter detection and removal
- Multi-K Assembly: TadpoleWrapper tests k=210,250,290 with L50/L90 evaluation
- Assembly Selection: TadpoleWrapper selects best assembly based on contiguity metrics
Implementation Strategy
The pipeline uses a dual data structure approach:
- Temporary File Management: Creates unique temporary files for each phase
- Sequential Processing: Each phase completes before the next begins
- Automatic Cleanup: Removes intermediate files by default (configurable)
- Parameter Forwarding: Passes phase-specific parameters to appropriate tools
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org