Reassemble

Basic Usage

reassemble.sh in=<input files> out=<output file> k=<kmer size>

Input may be a single file, comma-delimited list, directory, or wildcard pattern. Each input genome is assembled individually with the specified k-mer size. Assembled contigs are concatenated to the output file with taxID preservation.

Operational Modes

Reassemble supports two operating modes for temporary file management:

Append Mode (Default)

When tempdir is unset (default), Tadpole appends contigs directly to the output file. This is the most efficient mode, avoiding disk overhead of temporary files.

Temporary File Mode

When tempdir is set, each genome's assembly is written to a separate temporary file, then concatenated to the output. This is safer in error-prone environments but less efficient. Use delete=t to clean up temporary files after successful completion.

TaxID Handling

Reassemble automatically extracts and preserves TaxID information from input genomes. TaxIDs can be embedded in:

Filename patterns: tid_12345_genome.fa or similar patterns
FASTA headers: TaxID extracted from header metadata

Output contig names include the source genome's TaxID for downstream binning assessment and validation.

Common Use Cases

Binning Tool Evaluation

The primary use case is evaluating metagenomic binning tools on ground-truth datasets:

Start with pre-binned genomes (known origin) or known individual genomes
Run Reassemble to assemble each genome individually in isolation
Feed assembled contigs with embedded TaxIDs to binning tool for evaluation
Compare predicted bins against known origins to measure accuracy

Preventing Co-Assembly Artifacts

When assembling multiple related or similar genomes, individual assembly prevents chimeric contigs that span multiple genomes:

Co-assembly risk: Similar regions between genomes can create false contigs combining sequences from multiple origins
Individual assembly benefit: Each genome assembled in isolation, ensuring all contigs are from single source
Consequence: Ground-truth datasets remain clean for benchmarking and validation

Processing Pre-Binned Metagenomes

When metagenomic samples have been pre-binned into separate genome files:

Run Reassemble to generate assembled contigs for each bin
Use assembled output for quality assessment, annotation, or further analysis
TaxID preservation allows tracking of each contig's origin bin

Parameters

Input/Output Parameters

in=<file|dir>: Input files. Supports comma-delimited lists, directories, and wildcards. Each input file is assembled individually.
out=<file>: (Required) Output FASTA file for assembled contigs. All assembled sequences are concatenated here.

Assembly Parameters

k=31: K-mer size for assembly. Required parameter, must be a positive integer. Larger k values are more specific but require more coverage; smaller k values are more sensitive but less specific.
mcs=1: minCountSeed: Minimum kmer count to start extension. Default is 1 for sparse individual genomes. Increase for coverage-rich assemblies to reduce noise.
mce=1: minCountExtend: Minimum kmer count to extend contigs. Default is 1 for sparse individual genomes. Adjust based on expected coverage.
mincontig=1: Minimum contig length to output. Default is 1 to preserve all assembled sequences. Increase to filter very short contigs.

Temporary File Management

tempdir=<path>: Directory for temporary files. If unset (default), Tadpole appends directly to output (more efficient). If set, each genome's assembly is written to a separate temporary file before concatenation.
delete=t: Delete temporary files after successful concatenation. Only relevant if tempdir is set. Set to false to retain temporary assemblies for debugging.

Error Handling

failfast=f: Abort on first genome assembly failure. If false (default), continue processing remaining genomes even if one fails. Set to true to stop immediately on error for quick feedback on problematic inputs.

Logging and Output

verbose=f: Print detailed progress including Tadpole commands and per-file status. Useful for debugging and monitoring long assembly runs.

Tadpole Pass-Through Parameters

Additional Tadpole parameters: Any parameter not explicitly documented above is passed directly to Tadpole for each genome assembly. Refer to Tadpole documentation for complete parameter list. Common parameters include: mindepth, maxdepth, maxbranches, mindepthseed, etc.

Java Parameters

-Xmx8g: Memory limit (default 8GB). Increase for larger genomes or set lower for memory-constrained systems. Applies to each genome assembly independently.
-eoom: Exit on out-of-memory error. If an assembly runs out of memory, terminate gracefully instead of hanging. Requires Java 8u92+.

Special Behaviors

TaxID Extraction and Embedding

Reassemble automatically extracts TaxID information from input filenames and FASTA headers, then embeds these TaxIDs in the output contig names. This preserves the source genome identity through assembly and allows tracking of contigs during binning evaluation.

Contig ID Offsetting

Contig IDs are automatically offset between genomes to ensure uniqueness across all assembled genomes. This prevents contig ID collisions when multiple genomes are assembled in a single run.

Sequential Processing and Garbage Collection

Genomes are processed sequentially, not in parallel. Between each genome assembly, garbage collection is performed to reclaim memory used by the previous assembly. This allows processing many genomes with limited memory by releasing resources as soon as each genome completes.

Summary Table

After all genomes are processed, Reassemble prints a summary table showing:

Genome: Input filename
TaxID: Extracted or detected TaxID
Status: Success or failure code
Contigs: Number of contigs assembled
Bases: Total bases in assembled contigs

Examples

Basic Assembly of Multiple Genomes

reassemble.sh in=genomes/ out=assembled.fa k=31

Description: Assemble all FASTA files in the genomes/ directory individually with k=31. Results concatenated to assembled.fa. Uses default settings with direct append mode (no temporary files).

Assembly with TaxID-Labeled Genomes

reassemble.sh in=tid_12345_ecoli.fa,tid_67890_bsubtilis.fa out=assembled.fa k=31

Description: Assemble two genomes individually, extracting TaxIDs from filenames (12345 for E. coli, 67890 for B. subtilis). Output contigs will be labeled with their source TaxID.

Failfast Mode for Quick Validation

reassemble.sh in=genomes/ out=assembled.fa k=31 failfast=t

Description: Stop immediately on first assembly failure. Useful for testing and debugging to quickly identify problematic input genomes without waiting for entire batch to process.

Temporary File Mode with Cleanup

reassemble.sh in=genomes/ out=assembled.fa k=31 tempdir=/tmp/reassemble delete=t

Description: Each genome assembled to separate temporary file in /tmp/reassemble, then concatenated to assembled.fa. Temporary files deleted after successful completion. Safer for unreliable storage.

Verbose Assembly with Detailed Output

reassemble.sh in=genomes/ out=assembled.fa k=31 verbose=t

Description: Print Tadpole commands and per-genome status to console during assembly. Shows detailed progress for monitoring and debugging.

Custom K-mer and Assembly Parameters

reassemble.sh in=genomes/ out=assembled.fa k=25 mcs=2 mce=2 mincontig=500

Description: Assemble with k=25 (shorter kmers for lower coverage), mcs/mce=2 (require minimum 2-count kmers), mincontig=500 (filter contigs under 500bp). Useful for noisy or sparse assemblies.

Binning Evaluation Workflow

# Step 1: Assemble pre-binned genomes
reassemble.sh in=binned_genomes/ out=assembled_contigs.fa k=31

# Step 2: Run binning tool on assembled contigs
binning_tool in=assembled_contigs.fa out=predicted_bins.txt

# Step 3: Compare predicted bins against known TaxID labels

Workflow description: Start with metagenomic reads binned into separate genome files (pre-binned). Reassemble to get individual assembled contigs with TaxID labels. Run binning tool to predict which contigs belong to same genome. Compare predictions against known TaxIDs to measure binning accuracy.

Large Memory Systems

reassemble.sh -Xmx64g in=large_genomes/ out=assembled.fa k=31

Description: Allocate 64GB of memory per genome assembly for very large or complex genomes. Adjust -Xmx based on largest input genome and available system memory.

Low Memory Constrained Systems

reassemble.sh -Xmx2g in=genomes/ out=assembled.fa k=25 mcs=2 mce=2

Description: Run on systems with limited memory (2GB per genome) using smaller k-mer size and higher count thresholds. May reduce assembly quality but allows processing on constrained hardware.

Algorithm Details

Individual Assembly Strategy

Reassemble wraps the Tadpole assembler and processes each input genome independently:

For each input genome file:
- Extract or detect TaxID from filename or header
- Run Tadpole assembly with specified k-mer size and parameters
- Embed TaxID in output contig names
- Append (or write to temp file) assembled contigs
- Perform garbage collection to reclaim memory
If tempdir set, concatenate all temporary files to final output
Print summary table with assembly statistics per genome

Why Individual Assembly Matters

In co-assembly of multiple genomes, the assembler uses global k-mer graphs containing sequences from all genomes. When genomic regions share similarity (common in metagenomic samples), the assembler can create chimeric contigs spanning multiple genomes. Individual assembly prevents this:

Co-assembly: K-mer graph contains all genomes → similar regions bridge → chimeric contigs
Individual assembly: Each genome has isolated k-mer graph → no inter-genome bridges → contigs remain pure
Binning impact: Chimeric contigs confound binning tool evaluation by mixing signals from multiple sources

Contig ID Management

To maintain globally unique contig identifiers across assembled genomes:

Contig IDs are automatically offset based on genome processing order
Example: First genome contigs named contig_0_1, contig_0_2, etc.; second genome contig_1_1, contig_1_2, etc.
This prevents ID collisions when concatenating results from multiple independent assemblies

Sequential Processing Benefits

Reassemble processes genomes sequentially rather than in parallel:

Memory efficiency: Each assembly's memory is released before next genome starts
Scalability: Can process 1000+ genomes on system with memory for only 2-3 simultaneous assemblies
Simplicity: No thread synchronization overhead or race conditions
Predictability: Memory usage remains constant regardless of input size

Workflow Integration

Input Preprocessing

Reassemble expects relatively clean input genomes. For raw sequencing reads:

Pre-process with BBDuk to remove adapters and low-quality bases
Pre-bin reads using existing binning tools or ground-truth labels
Save each bin to separate FASTA file
Feed to Reassemble for individual assembly

Output Usage

Assembled contigs can be used for:

Binning evaluation: Feed to binning tools and compare predictions against TaxID labels
Functional annotation: Annotate contigs to understand bin composition
Quality assessment: Measure assembly completeness, contamination, and fragmentation
Reference database: Use assembled genomes as reference for downstream analysis

Relationship to Other Tools

Reassemble fits into broader assembly workflows:

Tadpole: The underlying de Bruijn graph assembler (Reassemble is a wrapper)
BBDuk: Use for preprocessing raw reads before binning
BBMap: Align reads to assembled contigs for validation
Binning tools: Evaluate using Reassemble output with embedded TaxID labels

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Tadpole documentation: For advanced assembly parameters and k-mer selection strategies