Tadpole
Kmer-based assembler with additional capabilities for error correction and sequence extension. Designed for fast, conservative assembly with extremely low misassembly rates and excellent handling of irregular coverage distributions.
Overview
Tadpole is a De Bruijn graph assembler optimized for specific use cases where speed, accuracy, and handling of irregular coverage are critical. Unlike complex scaffolding assemblers, Tadpole focuses on conservative kmer-based assembly without complicated graph analysis, making it particularly suitable for:
- Microbial single-cell genomics - handles extreme coverage variations
- Viral genome assembly - fast assembly with low misassembly rates
- Organellar genomes - efficient processing of high-coverage circular genomes
- Preliminary assemblies - for binning, quality recalibration, and insert-size estimation
- Coverage-specific assembly - can selectively assemble specific coverage bands
Tadpole supports unlimited kmer lengths and generates no temporary files or directories, making it ideal for automated pipelines.
Processing Modes
Tadpole operates in distinct processing modes, each optimized for different tasks:
Contig Mode (Default)
Single-pass processing where reads are loaded once to count kmers, then contigs are assembled and written to output. This is the fastest mode for pure assembly tasks.
tadpole.sh in=reads.fq out=contigs.fa k=93
Extension Mode
Two-pass processing: first pass counts kmers, second pass extends existing sequences bidirectionally. Can be combined with error correction.
tadpole.sh in=reads.fq out=extended.fq mode=extend k=93 el=50 er=50
Correction Mode
Two-pass processing focused on error correction using three algorithms: pincer (bidirectional), tail (end-focused), and reassemble (local rebuilding).
tadpole.sh in=reads.fq out=corrected.fq mode=correct k=50
Combined Processing
Extension and correction can be performed simultaneously in the second pass, though different optimal kmer lengths may require separate runs.
Kmer Length Selection
Kmer length is critical for optimal results and should be chosen based on the primary goal:
- Assembly: Use approximately 2/3 of read length (e.g., k=93 for 150bp reads)
- Error correction: Use approximately 1/3 of read length (e.g., k=50 for 150bp reads)
- Two-stage approach: Error correct with short kmers first, then assemble extended reads with long kmers for maximum continuity
Supported kmer lengths follow a structured pattern: 1-31 (all values), 32-62 (multiples of 2), 63-93 (multiples of 3), and continuing in this pattern to unlimited length.
Recommended Usage
Optimal Assembly
tadpole.sh in=reads.fq out=contigs.fa shave rinse pop k=62
Recommended parameters including dead-end removal (shave), bubble removal (rinse), and bubble popping for maximum contiguity while maintaining accuracy.
Two-Stage Assembly for Maximum Continuity
# Stage 1: Error correction and extension with short kmers
tadpole.sh in=reads.fq out=corrected.fq mode=correct k=31
# Stage 2: Assembly with longer kmers for better repeat resolution
tadpole.sh in=corrected.fq out=contigs.fa k=93 shave rinse pop
Optimal approach for complex genomes: correct and extend with short kmers, then assemble with long kmers.
Contig Extension with Additional Reads
tadpole.sh in=contigs.fa out=extended.fa el=100 er=100 mode=extend extra=reads.fq k=62
Extend existing contigs using additional sequencing data as kmer source.
Parameters
Input parameters
- in=<file>
- Primary input file for reads to use as kmer data.
- in2=<file>
- Second input file for paired data.
- extra=<file>
- Extra files for use as kmer data, but not for error-correction or extension.
- reads=-1
- Only process this number of reads, then quit (-1 means all).
Note: in, in2, and extra may be comma-delimited lists of files.
Output parameters
- out=<file>
- Write contigs (in contig mode) or corrected/extended reads (in other modes).
- out2=<file>
- Second output file for paired output.
- outd=<file>
- Write discarded reads, if using junk-removal flags.
- dot=<file>
- Write a contigs connectivity graph (partially implemented)
- dump=<file>
- Write kmers and their counts.
- fastadump=t
- Write kmers and counts as fasta versus 2-column tsv.
- mincounttodump=1
- Only dump kmers with at least this depth.
- showstats=t
- Print assembly statistics after writing contigs.
Prefiltering parameters
- prefilter=0
- If set to a positive integer, use a countmin sketch to ignore kmers with depth of that value or lower.
- prehashes=2
- Number of hashes for prefilter.
- prefiltersize=0.2
- (pff) Fraction of memory to use for prefilter.
- minprobprefilter=t
- (mpp) Use minprob for the prefilter.
- prepasses=1
- Use this many prefiltering passes; higher be more thorough if the filter is very full. Set to 'auto' to iteratively prefilter until the remaining kmers will fit in memory.
- onepass=f
- If true, prefilter will be generated in same pass as kmer counts. Much faster but counts will be lower, by up to prefilter's depth limit.
- filtermem=0
- Allows manually specifying prefilter memory in bytes, for deterministic runs. 0 will set it automatically.
Hashing parameters
- k=31
- Kmer length (1 to infinity). Memory use increases with K.
- prealloc=t
- Pre-allocate memory rather than dynamically growing; faster and more memory-efficient. A float fraction (0-1) may be specified; default is 1.
- minprob=0.5
- Ignore kmers with overall probability of correctness below this.
- minprobmain=t
- (mpm) Use minprob for the primary kmer counts.
- threads=X
- Spawn X worker threads; default is number of logical processors.
- buildthreads=X
- Spawn X contig-building threads. If not set, defaults to the same as threads. Setting this to 1 will make contigs deterministic.
- rcomp=t
- Store and count each kmer together and its reverse-complement.
- coremask=t
- All kmer extensions share the same hashcode.
- fillfast=t
- Speed up kmer extension lookups.
Assembly parameters
- mincountseed=3
- (mcs) Minimum kmer count to seed a new contig or begin extension.
- mincountextend=2
- (mce) Minimum kmer count continue extension of a read or contig. It is recommended that mce=1 for low-depth metagenomes.
- mincountretain=0
- (mincr) Discard kmers with count below this.
- maxcountretain=INF
- (maxcr) Discard kmers with count above this.
- branchmult1=20
- (bm1) Min ratio of 1st to 2nd-greatest path depth at high depth.
- branchmult2=3
- (bm2) Min ratio of 1st to 2nd-greatest path depth at low depth.
- branchlower=3
- (blc) Max value of 2nd-greatest path depth to be considered low.
- minextension=2
- (mine) Do not keep contigs that did not extend at least this much.
- mincontig=auto
- (minc) Do not write contigs shorter than this.
- mincoverage=1
- (mincov) Do not write contigs with average coverage below this.
- maxcoverage=inf
- (maxcov) Do not write contigs with average coverage above this.
- trimends=0
- (trim) Trim contig ends by this much. Trimming by K/2 may yield more accurate genome size estimation.
- trimcircular=t
- Trim one end of contigs ending in LOOP/LOOP by K-1, to eliminate the overlapping portion.
- contigpasses=16
- Build contigs with decreasing seed depth for this many iterations.
- contigpassmult=1.7
- Ratio between seed depth of two iterations.
- ownership=auto
- For concurrency; do not touch.
- processcontigs=f
- Explore the contig connectivity graph.
- popbubbles=t
- (pop) Pop bubbles; increases contiguity. Requires additional time and memory and forces processcontigs=t.
Processing mode parameters
- mode=contig
- contig: Make contigs from kmers.
extend: Extend sequences to be longer, and optionally perform error correction.
correct: Error correct only.
insert: Measure insert sizes.
discard: Discard low-depth reads, without error correction.
Extension parameters
- extendleft=100
- (el) Extend to the left by at most this many bases.
- extendright=100
- (er) Extend to the right by at most this many bases.
- ibb=t
- (ignorebackbranches) Do not stop at backward branches.
- extendrollback=3
- Trim a random number of bases, up to this many, on reads that extend only partially. This prevents the creation of sharp coverage discontinuities at branches.
Error-correction parameters
- ecc=f
- Error correct via kmer counts.
- reassemble=t
- If ecc is enabled, use the reassemble algorithm.
- pincer=f
- If ecc is enabled, use the pincer algorithm.
- tail=f
- If ecc is enabled, use the tail algorithm.
- eccfull=f
- If ecc is enabled, use tail over the entire read.
- aggressive=f
- (aecc) Use aggressive error correction settings. Overrides some other flags like errormult1 and deadzone.
- conservative=f
- (cecc) Use conservative error correction settings. Overrides some other flags like errormult1 and deadzone.
- rollback=t
- Undo changes to reads that have lower coverage for any kmer after correction.
- markbadbases=0
- (mbb) Any base fully covered by kmers with count below this will have its quality reduced.
- markdeltaonly=t
- (mdo) Only mark bad bases adjacent to good bases.
- meo=t
- (markerrorreadsonly) Only mark bad bases in reads containing errors.
- markquality=0
- (mq) Set quality scores for marked bases to this. A level of 0 will also convert the base to an N.
- errormult1=16
- (em1) Min ratio between kmer depths to call an error.
- errormult2=2.6
- (em2) Alternate ratio between low-depth kmers.
- errorlowerconst=3
- (elc) Use mult2 when the lower kmer is at most this deep.
- mincountcorrect=3
- (mcc) Don't correct to kmers with count under this.
- pathsimilarityfraction=0.45
- (psf) Max difference ratio considered similar. Controls whether a path appears to be continuous.
- pathsimilarityconstant=3
- (psc) Absolute differences below this are ignored.
- errorextensionreassemble=5
- (eer) Verify this many kmers before the error as having similar depth, for reassemble.
- errorextensionpincer=5
- (eep) Verify this many additional bases after the error as matching current bases, for pincer.
- errorextensiontail=9
- (eet) Verify additional bases before and after the error as matching current bases, for tail.
- deadzone=0
- (dz) Do not try to correct bases within this distance of read ends.
- window=12
- (w) Length of window to use in reassemble mode.
- windowcount=6
- (wc) If more than this many errors are found within a window, halt correction in that direction.
- qualsum=80
- (qs) If the sum of the qualities of corrected bases within a window exceeds this, halt correction in that direction.
- rbi=t
- (requirebidirectional) Require agreement from both directions when correcting errors in the middle part of the read using the reassemble algorithm.
- errorpath=1
- (ep) For debugging purposes.
Junk-removal parameters (to only remove junk, set mode=discard)
- tossjunk=f
- Remove reads that cannot be used for assembly. This means they have no kmers above depth 1 (2 for paired reads) and the outermost kmers cannot be extended. Pairs are removed only if both reads fail.
- tossdepth=-1
- Remove reads containing kmers at or below this depth. Pairs are removed if either read fails.
- lowdepthfraction=0
- (ldf) Require at least this fraction of kmers to be low-depth to discard a read; range 0-1. 0 still requires at least 1 low-depth kmer.
- requirebothbad=f
- (rbb) Only discard pairs if both reads are low-depth.
- tossuncorrectable=f
- (tu) Discard reads containing uncorrectable errors. Requires error-correction to be enabled.
Shaving parameters
- shave=f
- Remove dead ends (aka hair).
- rinse=f
- Remove bubbles.
- wash=
- Set shave and rinse at the same time.
- maxshavedepth=1
- (msd) Shave or rinse kmers at most this deep.
- exploredist=300
- (sed) Quit after exploring this far.
- discardlength=150
- (sdl) Discard shavings up to this long.
Note: Shave and rinse can produce substantially better assemblies for low-depth data, but they are very slow for large metagenomes. They are recommended for optimal results.
Overlap parameters (for overlapping paired-end reads only)
- merge=f
- Attempt to merge overlapping reads prior to kmer-counting, and again prior to correction. Output will still be unmerged pairs.
- ecco=f
- Error correct via overlap, but do not merge reads.
- testmerge=t
- Test kmer counts around the read merge junctions. If it appears that the merge created new errors, undo it.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Advanced Examples
Basic Assembly
tadpole.sh in=reads.fq out=contigs.fa k=93
Each contig consists of unique kmers, so contigs will not overlap by more than K-1 bases. Contigs end when there is a branch or dead-end in the kmer graph.
Error Correction with Multiple Algorithms
tadpole.sh in=reads.fq out=ecc.fq mode=correct k=50
Correction uses pincer algorithm for middle bases (bidirectional validation) and tail algorithm for end bases (where pincer cannot work).
Error Marking Without Correction
tadpole.sh in=reads.fq out=marked.fq mode=correct k=50 ecc=f mbb=2
Marks bases as errors (replacing with N) instead of correcting them. A base is marked if fully covered by kmers with depth below the specified value.
Multi-library Error Correction
tadpole.sh in=libA_r1.fq,libA_merged.fq in2=libA_r2.fq,null extra=libB_r1.fq out=ecc_libA_r1.fq,ecc_libA_merged.fq out2=ecc_libA_r2.fq,null mode=correct
Error corrects multiple files while using additional library data for kmer counts but not for output.
Coverage Band Assembly
tadpole.sh in=reads.fq out=contigs.fa mincoverage=1000 maxcoverage=1500 k=93
Selectively assembles only regions with coverage between 1000x and 1500x, useful for complex metagenomes or amplified samples.
Memory-Optimized Assembly
tadpole.sh in=reads.fq out=contigs.fa k=93 prefilter=2 minprob=0.6 prealloc=t
Uses prefiltering to remove low-depth kmers and quality-based kmer filtering to reduce memory usage on large datasets.
Algorithm Details
Conservative Assembly Strategy
Tadpole implements a conservative approach designed to minimize misassemblies in repetitive regions. The assembly algorithm uses adaptive branch resolution based on kmer depth ratios rather than absolute thresholds:
- High-depth branches: Requires a 20:1 ratio (branchmult1) between primary and secondary paths
- Low-depth branches: Uses a 3:1 ratio (branchmult2) when secondary path depth ≤ branchlower (default 3)
- Conservative stopping: Assembly terminates at ambiguous branches rather than making potentially incorrect extensions
- Iterative assembly: Multiple passes with decreasing seed depth thresholds recover additional sequence
Graph Cleaning Operations
Two complementary graph cleaning algorithms improve assembly continuity:
- Shave: Removes dead-end paths (hair) caused by sequencing errors. Examines terminal branches with very low depth that diverge from higher-depth paths
- Rinse: Removes low-depth bubbles that start and end at the same higher-depth path, typically caused by sequencing errors or minor variants
- Bubble popping: Advanced graph simplification that merges alternative paths, increasing contiguity at the cost of additional computation
Multi-Algorithm Error Correction
Error correction employs three specialized algorithms working in combination:
- Pincer algorithm: Corrects errors in the middle of reads using bidirectional kmer validation. Requires kmers on both sides to confirm corrections, providing high accuracy but limited to positions at least K bases from read ends
- Tail algorithm: Handles errors near read ends where pincer cannot operate. Less robust but essential for complete read correction
- Reassemble algorithm: Rebuilds local sequences around error sites using high-confidence kmers, particularly effective for clustered errors
- Rollback mechanism: Undoes corrections that result in reduced overall kmer coverage, preventing overcorrection
Flexible Kmer Length Implementation
Tadpole automatically switches between optimized implementations based on kmer length:
- K ≤ 31: Uses Tadpole1 with long integer-based hash tables for maximum speed
- K > 31: Switches to Tadpole2 with unlimited-length kmer data structures
- Supported lengths: 1-31 (all), 32-62 (even only), 63-93 (multiples of 3), continuing this pattern indefinitely
- Memory scaling: ~20 bytes per kmer for k≤31, ~30 bytes for k=32-62, increasing in steps of 31
Memory Management and Optimization
Multiple strategies reduce memory requirements for large datasets:
- Prefiltering: CountMin sketch removes low-frequency kmers before main processing, reducing memory by orders of magnitude for error-rich datasets
- Quality-based filtering: minprob parameter ignores kmers below specified correctness probability based on quality scores
- Pre-allocation: prealloc parameter eliminates hash table growth overhead and improves memory locality
- Iterative processing: Multiple passes with different depth thresholds maximize kmer utilization while controlling memory usage
Parallel Processing Architecture
Multi-threaded implementation provides near-linear scaling:
- Kmer counting: Lock-free parallel hash table updates using atomic operations
- Contig building: Independent worker threads with ownership-based kmer claiming to prevent conflicts
- Graph operations: Parallel exploration of graph components for shaving, rinsing, and bubble detection
- Deterministic mode: Setting buildthreads=1 produces identical results across runs for reproducible assembly
Memory and Performance Considerations
Tadpole's memory usage is primarily determined by the number of unique kmers in the dataset:
- Memory per kmer: ~20 bytes (k≤31), ~30 bytes (k=32-62), increasing by ~10 bytes per additional 31 bases
- Error kmer dominance: Most memory is typically consumed by error kmers rather than genomic kmers
- Memory reduction strategies: prefilter removes low-depth kmers, minprob filters by quality, BBNorm can pre-normalize highly irregular datasets
- Default behavior: Claims all available memory by default for optimal performance
- Scalability: Handles datasets from small viral genomes to complex metagenomes with appropriate parameter tuning
Dataset-Specific Recommendations
- Single-cell microbial: Use shave and rinse for graph cleaning, consider coverage band assembly
- Viral genomes: Standard parameters with moderate k (50-70% read length) for speed
- Organellar genomes: Higher k values for repeat resolution, trimcircular=t for circular molecules
- Metagenomes: Lower mincountextend (=1), consider prefiltering for memory management
- Amplified samples: Coverage band assembly to focus on target regions
Troubleshooting
- Fragmented assemblies: Try longer k values, reduce branchmult1/branchmult2 (e.g., bm1=8 bm2=2), enable shave/rinse/pop
- Memory issues: Use prefilter, increase minprob, consider BBNorm preprocessing
- Poor error correction: Adjust k value (try k = read_length/3), tune correction algorithm parameters
- Very low coverage regions: Set mincountextend=1, reduce mincountseed
- Deterministic results needed: Set buildthreads=1
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: BBTools Documentation
- Guide: TadpoleGuide.txt