Tadpole

Script: tadpole.sh Package: assemble Class: Tadpole.java

Kmer-based assembler with additional capabilities for error correction and sequence extension. Designed for fast, conservative assembly with extremely low misassembly rates and excellent handling of irregular coverage distributions.

Overview

Tadpole is a De Bruijn graph assembler optimized for specific use cases where speed, accuracy, and handling of irregular coverage are critical. Unlike complex scaffolding assemblers, Tadpole focuses on conservative kmer-based assembly without complicated graph analysis, making it particularly suitable for:

Microbial single-cell genomics - handles extreme coverage variations
Viral genome assembly - fast assembly with low misassembly rates
Organellar genomes - efficient processing of high-coverage circular genomes
Preliminary assemblies - for binning, quality recalibration, and insert-size estimation
Coverage-specific assembly - can selectively assemble specific coverage bands

Tadpole supports unlimited kmer lengths and generates no temporary files or directories, making it ideal for automated pipelines.

Processing Modes

Tadpole operates in distinct processing modes, each optimized for different tasks:

Contig Mode (Default)

Single-pass processing where reads are loaded once to count kmers, then contigs are assembled and written to output. This is the fastest mode for pure assembly tasks.

tadpole.sh in=reads.fq out=contigs.fa k=93

Extension Mode

Two-pass processing: first pass counts kmers, second pass extends existing sequences bidirectionally. Can be combined with error correction.

tadpole.sh in=reads.fq out=extended.fq mode=extend k=93 el=50 er=50

Correction Mode

Two-pass processing focused on error correction using three algorithms: pincer (bidirectional), tail (end-focused), and reassemble (local rebuilding).

tadpole.sh in=reads.fq out=corrected.fq mode=correct k=50

Combined Processing

Extension and correction can be performed simultaneously in the second pass, though different optimal kmer lengths may require separate runs.

Kmer Length Selection

Kmer length is critical for optimal results and should be chosen based on the primary goal:

Assembly: Use approximately 2/3 of read length (e.g., k=93 for 150bp reads)
Error correction: Use approximately 1/3 of read length (e.g., k=50 for 150bp reads)
Two-stage approach: Error correct with short kmers first, then assemble extended reads with long kmers for maximum continuity

Supported kmer lengths follow a structured pattern: 1-31 (all values), 32-62 (multiples of 2), 63-93 (multiples of 3), and continuing in this pattern to unlimited length.

Recommended Usage

Optimal Assembly

tadpole.sh in=reads.fq out=contigs.fa shave rinse pop k=62

Recommended parameters including dead-end removal (shave), bubble removal (rinse), and bubble popping for maximum contiguity while maintaining accuracy.

Two-Stage Assembly for Maximum Continuity

# Stage 1: Error correction and extension with short kmers
tadpole.sh in=reads.fq out=corrected.fq mode=correct k=31

# Stage 2: Assembly with longer kmers for better repeat resolution
tadpole.sh in=corrected.fq out=contigs.fa k=93 shave rinse pop

Optimal approach for complex genomes: correct and extend with short kmers, then assemble with long kmers.

Contig Extension with Additional Reads

tadpole.sh in=contigs.fa out=extended.fa el=100 er=100 mode=extend extra=reads.fq k=62

Extend existing contigs using additional sequencing data as kmer source.

Parameters

Input parameters

in=<file>: Primary input file for reads to use as kmer data.
in2=<file>: Second input file for paired data.
extra=<file>: Extra files for use as kmer data, but not for error-correction or extension.
reads=-1: Only process this number of reads, then quit (-1 means all).

Note: in, in2, and extra may be comma-delimited lists of files.

Output parameters

out=<file>: Write contigs (in contig mode) or corrected/extended reads (in other modes).
out2=<file>: Second output file for paired output.
outd=<file>: Write discarded reads, if using junk-removal flags.
dot=<file>: Write a contigs connectivity graph (partially implemented)
dump=<file>: Write kmers and their counts.
fastadump=t: Write kmers and counts as fasta versus 2-column tsv.
mincounttodump=1: Only dump kmers with at least this depth.
showstats=t: Print assembly statistics after writing contigs.

Prefiltering parameters

prefilter=0: If set to a positive integer, use a countmin sketch to ignore kmers with depth of that value or lower.
prehashes=2: Number of hashes for prefilter.
prefiltersize=0.2: (pff) Fraction of memory to use for prefilter.
minprobprefilter=t: (mpp) Use minprob for the prefilter.
prepasses=1: Use this many prefiltering passes; higher be more thorough if the filter is very full. Set to 'auto' to iteratively prefilter until the remaining kmers will fit in memory.
onepass=f: If true, prefilter will be generated in same pass as kmer counts. Much faster but counts will be lower, by up to prefilter's depth limit.
filtermem=0: Allows manually specifying prefilter memory in bytes, for deterministic runs. 0 will set it automatically.

Hashing parameters

k=31: Kmer length (1 to infinity). Memory use increases with K.
prealloc=t: Pre-allocate memory rather than dynamically growing; faster and more memory-efficient. A float fraction (0-1) may be specified; default is 1.
minprob=0.5: Ignore kmers with overall probability of correctness below this.
minprobmain=t: (mpm) Use minprob for the primary kmer counts.
threads=X: Spawn X worker threads; default is number of logical processors.
buildthreads=X: Spawn X contig-building threads. If not set, defaults to the same as threads. Setting this to 1 will make contigs deterministic.
rcomp=t: Store and count each kmer together and its reverse-complement.
coremask=t: All kmer extensions share the same hashcode.
fillfast=t: Speed up kmer extension lookups.

Assembly parameters

mincountseed=3: (mcs) Minimum kmer count to seed a new contig or begin extension.
mincountextend=2: (mce) Minimum kmer count continue extension of a read or contig. It is recommended that mce=1 for low-depth metagenomes.
mincountretain=0: (mincr) Discard kmers with count below this.
maxcountretain=INF: (maxcr) Discard kmers with count above this.
branchmult1=20: (bm1) Min ratio of 1st to 2nd-greatest path depth at high depth.
branchmult2=3: (bm2) Min ratio of 1st to 2nd-greatest path depth at low depth.
branchlower=3: (blc) Max value of 2nd-greatest path depth to be considered low.
minextension=2: (mine) Do not keep contigs that did not extend at least this much.
mincontig=auto: (minc) Do not write contigs shorter than this.
mincoverage=1: (mincov) Do not write contigs with average coverage below this.
maxcoverage=inf: (maxcov) Do not write contigs with average coverage above this.
trimends=0: (trim) Trim contig ends by this much. Trimming by K/2 may yield more accurate genome size estimation.
trimcircular=t: Trim one end of contigs ending in LOOP/LOOP by K-1, to eliminate the overlapping portion.
contigpasses=16: Build contigs with decreasing seed depth for this many iterations.
contigpassmult=1.7: Ratio between seed depth of two iterations.
ownership=auto: For concurrency; do not touch.
processcontigs=f: Explore the contig connectivity graph.
popbubbles=t: (pop) Pop bubbles; increases contiguity. Requires additional time and memory and forces processcontigs=t.

Processing mode parameters

mode=contig: contig: Make contigs from kmers.
extend: Extend sequences to be longer, and optionally perform error correction.
correct: Error correct only.
insert: Measure insert sizes.
discard: Discard low-depth reads, without error correction.

Extension parameters

extendleft=100: (el) Extend to the left by at most this many bases.
extendright=100: (er) Extend to the right by at most this many bases.
ibb=t: (ignorebackbranches) Do not stop at backward branches.
extendrollback=3: Trim a random number of bases, up to this many, on reads that extend only partially. This prevents the creation of sharp coverage discontinuities at branches.

Error-correction parameters

ecc=f: Error correct via kmer counts.
reassemble=t: If ecc is enabled, use the reassemble algorithm.
pincer=f: If ecc is enabled, use the pincer algorithm.
tail=f: If ecc is enabled, use the tail algorithm.
eccfull=f: If ecc is enabled, use tail over the entire read.
aggressive=f: (aecc) Use aggressive error correction settings. Overrides some other flags like errormult1 and deadzone.
conservative=f: (cecc) Use conservative error correction settings. Overrides some other flags like errormult1 and deadzone.
rollback=t: Undo changes to reads that have lower coverage for any kmer after correction.
markbadbases=0: (mbb) Any base fully covered by kmers with count below this will have its quality reduced.
markdeltaonly=t: (mdo) Only mark bad bases adjacent to good bases.
meo=t: (markerrorreadsonly) Only mark bad bases in reads containing errors.
markquality=0: (mq) Set quality scores for marked bases to this. A level of 0 will also convert the base to an N.
errormult1=16: (em1) Min ratio between kmer depths to call an error.
errormult2=2.6: (em2) Alternate ratio between low-depth kmers.
errorlowerconst=3: (elc) Use mult2 when the lower kmer is at most this deep.
mincountcorrect=3: (mcc) Don't correct to kmers with count under this.
pathsimilarityfraction=0.45: (psf) Max difference ratio considered similar. Controls whether a path appears to be continuous.
pathsimilarityconstant=3: (psc) Absolute differences below this are ignored.
errorextensionreassemble=5: (eer) Verify this many kmers before the error as having similar depth, for reassemble.
errorextensionpincer=5: (eep) Verify this many additional bases after the error as matching current bases, for pincer.
errorextensiontail=9: (eet) Verify additional bases before and after the error as matching current bases, for tail.
deadzone=0: (dz) Do not try to correct bases within this distance of read ends.
window=12: (w) Length of window to use in reassemble mode.
windowcount=6: (wc) If more than this many errors are found within a window, halt correction in that direction.
qualsum=80: (qs) If the sum of the qualities of corrected bases within a window exceeds this, halt correction in that direction.
rbi=t: (requirebidirectional) Require agreement from both directions when correcting errors in the middle part of the read using the reassemble algorithm.
errorpath=1: (ep) For debugging purposes.

Junk-removal parameters (to only remove junk, set mode=discard)

tossjunk=f: Remove reads that cannot be used for assembly. This means they have no kmers above depth 1 (2 for paired reads) and the outermost kmers cannot be extended. Pairs are removed only if both reads fail.
tossdepth=-1: Remove reads containing kmers at or below this depth. Pairs are removed if either read fails.
lowdepthfraction=0: (ldf) Require at least this fraction of kmers to be low-depth to discard a read; range 0-1. 0 still requires at least 1 low-depth kmer.
requirebothbad=f: (rbb) Only discard pairs if both reads are low-depth.
tossuncorrectable=f: (tu) Discard reads containing uncorrectable errors. Requires error-correction to be enabled.

Shaving parameters

shave=f: Remove dead ends (aka hair).
rinse=f: Remove bubbles.
wash=: Set shave and rinse at the same time.
maxshavedepth=1: (msd) Shave or rinse kmers at most this deep.
exploredist=300: (sed) Quit after exploring this far.
discardlength=150: (sdl) Discard shavings up to this long.

Note: Shave and rinse can produce substantially better assemblies for low-depth data, but they are very slow for large metagenomes. They are recommended for optimal results.

Overlap parameters (for overlapping paired-end reads only)

merge=f: Attempt to merge overlapping reads prior to kmer-counting, and again prior to correction. Output will still be unmerged pairs.
ecco=f: Error correct via overlap, but do not merge reads.
testmerge=t: Test kmer counts around the read merge junctions. If it appears that the merge created new errors, undo it.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Advanced Examples

Basic Assembly

tadpole.sh in=reads.fq out=contigs.fa k=93

Each contig consists of unique kmers, so contigs will not overlap by more than K-1 bases. Contigs end when there is a branch or dead-end in the kmer graph.

Error Correction with Multiple Algorithms

tadpole.sh in=reads.fq out=ecc.fq mode=correct k=50

Correction uses pincer algorithm for middle bases (bidirectional validation) and tail algorithm for end bases (where pincer cannot work).

Error Marking Without Correction

tadpole.sh in=reads.fq out=marked.fq mode=correct k=50 ecc=f mbb=2

Marks bases as errors (replacing with N) instead of correcting them. A base is marked if fully covered by kmers with depth below the specified value.

Multi-library Error Correction

tadpole.sh in=libA_r1.fq,libA_merged.fq in2=libA_r2.fq,null extra=libB_r1.fq out=ecc_libA_r1.fq,ecc_libA_merged.fq out2=ecc_libA_r2.fq,null mode=correct

Error corrects multiple files while using additional library data for kmer counts but not for output.

Coverage Band Assembly

tadpole.sh in=reads.fq out=contigs.fa mincoverage=1000 maxcoverage=1500 k=93

Selectively assembles only regions with coverage between 1000x and 1500x, useful for complex metagenomes or amplified samples.

Memory-Optimized Assembly

tadpole.sh in=reads.fq out=contigs.fa k=93 prefilter=2 minprob=0.6 prealloc=t

Uses prefiltering to remove low-depth kmers and quality-based kmer filtering to reduce memory usage on large datasets.

Algorithm Details

Conservative Assembly Strategy

Tadpole implements a conservative approach designed to minimize misassemblies in repetitive regions. The assembly algorithm uses adaptive branch resolution based on kmer depth ratios rather than absolute thresholds:

High-depth branches: Requires a 20:1 ratio (branchmult1) between primary and secondary paths
Low-depth branches: Uses a 3:1 ratio (branchmult2) when secondary path depth ≤ branchlower (default 3)
Conservative stopping: Assembly terminates at ambiguous branches rather than making potentially incorrect extensions
Iterative assembly: Multiple passes with decreasing seed depth thresholds recover additional sequence

Graph Cleaning Operations

Two complementary graph cleaning algorithms improve assembly continuity:

Shave: Removes dead-end paths (hair) caused by sequencing errors. Examines terminal branches with very low depth that diverge from higher-depth paths
Rinse: Removes low-depth bubbles that start and end at the same higher-depth path, typically caused by sequencing errors or minor variants
Bubble popping: Advanced graph simplification that merges alternative paths, increasing contiguity at the cost of additional computation

Multi-Algorithm Error Correction

Error correction employs three specialized algorithms working in combination:

Pincer algorithm: Corrects errors in the middle of reads using bidirectional kmer validation. Requires kmers on both sides to confirm corrections, providing high accuracy but limited to positions at least K bases from read ends
Tail algorithm: Handles errors near read ends where pincer cannot operate. Less robust but essential for complete read correction
Reassemble algorithm: Rebuilds local sequences around error sites using high-confidence kmers, particularly effective for clustered errors
Rollback mechanism: Undoes corrections that result in reduced overall kmer coverage, preventing overcorrection

Flexible Kmer Length Implementation

Tadpole automatically switches between optimized implementations based on kmer length:

K ≤ 31: Uses Tadpole1 with long integer-based hash tables for maximum speed
K > 31: Switches to Tadpole2 with unlimited-length kmer data structures
Supported lengths: 1-31 (all), 32-62 (even only), 63-93 (multiples of 3), continuing this pattern indefinitely
Memory scaling: ~20 bytes per kmer for k≤31, ~30 bytes for k=32-62, increasing in steps of 31

Memory Management and Optimization

Multiple strategies reduce memory requirements for large datasets:

Prefiltering: CountMin sketch removes low-frequency kmers before main processing, reducing memory by orders of magnitude for error-rich datasets
Quality-based filtering: minprob parameter ignores kmers below specified correctness probability based on quality scores
Pre-allocation: prealloc parameter eliminates hash table growth overhead and improves memory locality
Iterative processing: Multiple passes with different depth thresholds maximize kmer utilization while controlling memory usage

Parallel Processing Architecture

Multi-threaded implementation provides near-linear scaling:

Kmer counting: Lock-free parallel hash table updates using atomic operations
Contig building: Independent worker threads with ownership-based kmer claiming to prevent conflicts
Graph operations: Parallel exploration of graph components for shaving, rinsing, and bubble detection
Deterministic mode: Setting buildthreads=1 produces identical results across runs for reproducible assembly

Memory and Performance Considerations

Tadpole's memory usage is primarily determined by the number of unique kmers in the dataset:

Memory per kmer: ~20 bytes (k≤31), ~30 bytes (k=32-62), increasing by ~10 bytes per additional 31 bases
Error kmer dominance: Most memory is typically consumed by error kmers rather than genomic kmers
Memory reduction strategies: prefilter removes low-depth kmers, minprob filters by quality, BBNorm can pre-normalize highly irregular datasets
Default behavior: Claims all available memory by default for optimal performance
Scalability: Handles datasets from small viral genomes to complex metagenomes with appropriate parameter tuning

Dataset-Specific Recommendations

Single-cell microbial: Use shave and rinse for graph cleaning, consider coverage band assembly
Viral genomes: Standard parameters with moderate k (50-70% read length) for speed
Organellar genomes: Higher k values for repeat resolution, trimcircular=t for circular molecules
Metagenomes: Lower mincountextend (=1), consider prefiltering for memory management
Amplified samples: Coverage band assembly to focus on target regions

Troubleshooting

Fragmented assemblies: Try longer k values, reduce branchmult1/branchmult2 (e.g., bm1=8 bm2=2), enable shave/rinse/pop
Memory issues: Use prefilter, increase minprob, consider BBNorm preprocessing
Poor error correction: Adjust k value (try k = read_length/3), tune correction algorithm parameters
Very low coverage regions: Set mincountextend=1, reduce mincountseed
Deterministic results needed: Set buildthreads=1

Support

For questions and support: