Tadpole

Script: tadpole.sh Package: assemble Class: Tadpole.java

Kmer-based assembler with additional capabilities for error correction and sequence extension. Designed for fast, conservative assembly with extremely low misassembly rates and excellent handling of irregular coverage distributions.

Overview

Tadpole is a De Bruijn graph assembler optimized for specific use cases where speed, accuracy, and handling of irregular coverage are critical. Unlike complex scaffolding assemblers, Tadpole focuses on conservative kmer-based assembly without complicated graph analysis, making it particularly suitable for:

Tadpole supports unlimited kmer lengths and generates no temporary files or directories, making it ideal for automated pipelines.

Processing Modes

Tadpole operates in distinct processing modes, each optimized for different tasks:

Contig Mode (Default)

Single-pass processing where reads are loaded once to count kmers, then contigs are assembled and written to output. This is the fastest mode for pure assembly tasks.

tadpole.sh in=reads.fq out=contigs.fa k=93

Extension Mode

Two-pass processing: first pass counts kmers, second pass extends existing sequences bidirectionally. Can be combined with error correction.

tadpole.sh in=reads.fq out=extended.fq mode=extend k=93 el=50 er=50

Correction Mode

Two-pass processing focused on error correction using three algorithms: pincer (bidirectional), tail (end-focused), and reassemble (local rebuilding).

tadpole.sh in=reads.fq out=corrected.fq mode=correct k=50

Combined Processing

Extension and correction can be performed simultaneously in the second pass, though different optimal kmer lengths may require separate runs.

Kmer Length Selection

Kmer length is critical for optimal results and should be chosen based on the primary goal:

Supported kmer lengths follow a structured pattern: 1-31 (all values), 32-62 (multiples of 2), 63-93 (multiples of 3), and continuing in this pattern to unlimited length.

Recommended Usage

Optimal Assembly

tadpole.sh in=reads.fq out=contigs.fa shave rinse pop k=62

Recommended parameters including dead-end removal (shave), bubble removal (rinse), and bubble popping for maximum contiguity while maintaining accuracy.

Two-Stage Assembly for Maximum Continuity

# Stage 1: Error correction and extension with short kmers
tadpole.sh in=reads.fq out=corrected.fq mode=correct k=31

# Stage 2: Assembly with longer kmers for better repeat resolution
tadpole.sh in=corrected.fq out=contigs.fa k=93 shave rinse pop

Optimal approach for complex genomes: correct and extend with short kmers, then assemble with long kmers.

Contig Extension with Additional Reads

tadpole.sh in=contigs.fa out=extended.fa el=100 er=100 mode=extend extra=reads.fq k=62

Extend existing contigs using additional sequencing data as kmer source.

Parameters

Input parameters

in=<file>
Primary input file for reads to use as kmer data.
in2=<file>
Second input file for paired data.
extra=<file>
Extra files for use as kmer data, but not for error-correction or extension.
reads=-1
Only process this number of reads, then quit (-1 means all).

Note: in, in2, and extra may be comma-delimited lists of files.

Output parameters

out=<file>
Write contigs (in contig mode) or corrected/extended reads (in other modes).
out2=<file>
Second output file for paired output.
outd=<file>
Write discarded reads, if using junk-removal flags.
dot=<file>
Write a contigs connectivity graph (partially implemented)
dump=<file>
Write kmers and their counts.
fastadump=t
Write kmers and counts as fasta versus 2-column tsv.
mincounttodump=1
Only dump kmers with at least this depth.
showstats=t
Print assembly statistics after writing contigs.

Prefiltering parameters

prefilter=0
If set to a positive integer, use a countmin sketch to ignore kmers with depth of that value or lower.
prehashes=2
Number of hashes for prefilter.
prefiltersize=0.2
(pff) Fraction of memory to use for prefilter.
minprobprefilter=t
(mpp) Use minprob for the prefilter.
prepasses=1
Use this many prefiltering passes; higher be more thorough if the filter is very full. Set to 'auto' to iteratively prefilter until the remaining kmers will fit in memory.
onepass=f
If true, prefilter will be generated in same pass as kmer counts. Much faster but counts will be lower, by up to prefilter's depth limit.
filtermem=0
Allows manually specifying prefilter memory in bytes, for deterministic runs. 0 will set it automatically.

Hashing parameters

k=31
Kmer length (1 to infinity). Memory use increases with K.
prealloc=t
Pre-allocate memory rather than dynamically growing; faster and more memory-efficient. A float fraction (0-1) may be specified; default is 1.
minprob=0.5
Ignore kmers with overall probability of correctness below this.
minprobmain=t
(mpm) Use minprob for the primary kmer counts.
threads=X
Spawn X worker threads; default is number of logical processors.
buildthreads=X
Spawn X contig-building threads. If not set, defaults to the same as threads. Setting this to 1 will make contigs deterministic.
rcomp=t
Store and count each kmer together and its reverse-complement.
coremask=t
All kmer extensions share the same hashcode.
fillfast=t
Speed up kmer extension lookups.

Assembly parameters

mincountseed=3
(mcs) Minimum kmer count to seed a new contig or begin extension.
mincountextend=2
(mce) Minimum kmer count continue extension of a read or contig. It is recommended that mce=1 for low-depth metagenomes.
mincountretain=0
(mincr) Discard kmers with count below this.
maxcountretain=INF
(maxcr) Discard kmers with count above this.
branchmult1=20
(bm1) Min ratio of 1st to 2nd-greatest path depth at high depth.
branchmult2=3
(bm2) Min ratio of 1st to 2nd-greatest path depth at low depth.
branchlower=3
(blc) Max value of 2nd-greatest path depth to be considered low.
minextension=2
(mine) Do not keep contigs that did not extend at least this much.
mincontig=auto
(minc) Do not write contigs shorter than this.
mincoverage=1
(mincov) Do not write contigs with average coverage below this.
maxcoverage=inf
(maxcov) Do not write contigs with average coverage above this.
trimends=0
(trim) Trim contig ends by this much. Trimming by K/2 may yield more accurate genome size estimation.
trimcircular=t
Trim one end of contigs ending in LOOP/LOOP by K-1, to eliminate the overlapping portion.
contigpasses=16
Build contigs with decreasing seed depth for this many iterations.
contigpassmult=1.7
Ratio between seed depth of two iterations.
ownership=auto
For concurrency; do not touch.
processcontigs=f
Explore the contig connectivity graph.
popbubbles=t
(pop) Pop bubbles; increases contiguity. Requires additional time and memory and forces processcontigs=t.

Processing mode parameters

mode=contig
contig: Make contigs from kmers.
extend: Extend sequences to be longer, and optionally perform error correction.
correct: Error correct only.
insert: Measure insert sizes.
discard: Discard low-depth reads, without error correction.

Extension parameters

extendleft=100
(el) Extend to the left by at most this many bases.
extendright=100
(er) Extend to the right by at most this many bases.
ibb=t
(ignorebackbranches) Do not stop at backward branches.
extendrollback=3
Trim a random number of bases, up to this many, on reads that extend only partially. This prevents the creation of sharp coverage discontinuities at branches.

Error-correction parameters

ecc=f
Error correct via kmer counts.
reassemble=t
If ecc is enabled, use the reassemble algorithm.
pincer=f
If ecc is enabled, use the pincer algorithm.
tail=f
If ecc is enabled, use the tail algorithm.
eccfull=f
If ecc is enabled, use tail over the entire read.
aggressive=f
(aecc) Use aggressive error correction settings. Overrides some other flags like errormult1 and deadzone.
conservative=f
(cecc) Use conservative error correction settings. Overrides some other flags like errormult1 and deadzone.
rollback=t
Undo changes to reads that have lower coverage for any kmer after correction.
markbadbases=0
(mbb) Any base fully covered by kmers with count below this will have its quality reduced.
markdeltaonly=t
(mdo) Only mark bad bases adjacent to good bases.
meo=t
(markerrorreadsonly) Only mark bad bases in reads containing errors.
markquality=0
(mq) Set quality scores for marked bases to this. A level of 0 will also convert the base to an N.
errormult1=16
(em1) Min ratio between kmer depths to call an error.
errormult2=2.6
(em2) Alternate ratio between low-depth kmers.
errorlowerconst=3
(elc) Use mult2 when the lower kmer is at most this deep.
mincountcorrect=3
(mcc) Don't correct to kmers with count under this.
pathsimilarityfraction=0.45
(psf) Max difference ratio considered similar. Controls whether a path appears to be continuous.
pathsimilarityconstant=3
(psc) Absolute differences below this are ignored.
errorextensionreassemble=5
(eer) Verify this many kmers before the error as having similar depth, for reassemble.
errorextensionpincer=5
(eep) Verify this many additional bases after the error as matching current bases, for pincer.
errorextensiontail=9
(eet) Verify additional bases before and after the error as matching current bases, for tail.
deadzone=0
(dz) Do not try to correct bases within this distance of read ends.
window=12
(w) Length of window to use in reassemble mode.
windowcount=6
(wc) If more than this many errors are found within a window, halt correction in that direction.
qualsum=80
(qs) If the sum of the qualities of corrected bases within a window exceeds this, halt correction in that direction.
rbi=t
(requirebidirectional) Require agreement from both directions when correcting errors in the middle part of the read using the reassemble algorithm.
errorpath=1
(ep) For debugging purposes.

Junk-removal parameters (to only remove junk, set mode=discard)

tossjunk=f
Remove reads that cannot be used for assembly. This means they have no kmers above depth 1 (2 for paired reads) and the outermost kmers cannot be extended. Pairs are removed only if both reads fail.
tossdepth=-1
Remove reads containing kmers at or below this depth. Pairs are removed if either read fails.
lowdepthfraction=0
(ldf) Require at least this fraction of kmers to be low-depth to discard a read; range 0-1. 0 still requires at least 1 low-depth kmer.
requirebothbad=f
(rbb) Only discard pairs if both reads are low-depth.
tossuncorrectable=f
(tu) Discard reads containing uncorrectable errors. Requires error-correction to be enabled.

Shaving parameters

shave=f
Remove dead ends (aka hair).
rinse=f
Remove bubbles.
wash=
Set shave and rinse at the same time.
maxshavedepth=1
(msd) Shave or rinse kmers at most this deep.
exploredist=300
(sed) Quit after exploring this far.
discardlength=150
(sdl) Discard shavings up to this long.

Note: Shave and rinse can produce substantially better assemblies for low-depth data, but they are very slow for large metagenomes. They are recommended for optimal results.

Overlap parameters (for overlapping paired-end reads only)

merge=f
Attempt to merge overlapping reads prior to kmer-counting, and again prior to correction. Output will still be unmerged pairs.
ecco=f
Error correct via overlap, but do not merge reads.
testmerge=t
Test kmer counts around the read merge junctions. If it appears that the merge created new errors, undo it.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Advanced Examples

Basic Assembly

tadpole.sh in=reads.fq out=contigs.fa k=93

Each contig consists of unique kmers, so contigs will not overlap by more than K-1 bases. Contigs end when there is a branch or dead-end in the kmer graph.

Error Correction with Multiple Algorithms

tadpole.sh in=reads.fq out=ecc.fq mode=correct k=50

Correction uses pincer algorithm for middle bases (bidirectional validation) and tail algorithm for end bases (where pincer cannot work).

Error Marking Without Correction

tadpole.sh in=reads.fq out=marked.fq mode=correct k=50 ecc=f mbb=2

Marks bases as errors (replacing with N) instead of correcting them. A base is marked if fully covered by kmers with depth below the specified value.

Multi-library Error Correction

tadpole.sh in=libA_r1.fq,libA_merged.fq in2=libA_r2.fq,null extra=libB_r1.fq out=ecc_libA_r1.fq,ecc_libA_merged.fq out2=ecc_libA_r2.fq,null mode=correct

Error corrects multiple files while using additional library data for kmer counts but not for output.

Coverage Band Assembly

tadpole.sh in=reads.fq out=contigs.fa mincoverage=1000 maxcoverage=1500 k=93

Selectively assembles only regions with coverage between 1000x and 1500x, useful for complex metagenomes or amplified samples.

Memory-Optimized Assembly

tadpole.sh in=reads.fq out=contigs.fa k=93 prefilter=2 minprob=0.6 prealloc=t

Uses prefiltering to remove low-depth kmers and quality-based kmer filtering to reduce memory usage on large datasets.

Algorithm Details

Conservative Assembly Strategy

Tadpole implements a conservative approach designed to minimize misassemblies in repetitive regions. The assembly algorithm uses adaptive branch resolution based on kmer depth ratios rather than absolute thresholds:

Graph Cleaning Operations

Two complementary graph cleaning algorithms improve assembly continuity:

Multi-Algorithm Error Correction

Error correction employs three specialized algorithms working in combination:

Flexible Kmer Length Implementation

Tadpole automatically switches between optimized implementations based on kmer length:

Memory Management and Optimization

Multiple strategies reduce memory requirements for large datasets:

Parallel Processing Architecture

Multi-threaded implementation provides near-linear scaling:

Memory and Performance Considerations

Tadpole's memory usage is primarily determined by the number of unique kmers in the dataset:

Dataset-Specific Recommendations

Troubleshooting

Support

For questions and support: