Reassemble

Script: reassemble.sh Package: jgi Class: assemble.Reassemble.java

Assembles multiple genome files individually using Tadpole, then concatenates the results. Unlike co-assembly, each genome is processed in isolation, preventing chimeric contigs across genome boundaries. Primarily used for evaluating metagenomic binning tools using ground-truth datasets where individual genome origins are known.

Basic Usage

reassemble.sh in=<input files> out=<output file> k=<kmer size>

Input may be a single file, comma-delimited list, directory, or wildcard pattern. Each input genome is assembled individually with the specified k-mer size. Assembled contigs are concatenated to the output file with taxID preservation.

Operational Modes

Reassemble supports two operating modes for temporary file management:

Append Mode (Default)

When tempdir is unset (default), Tadpole appends contigs directly to the output file. This is the most efficient mode, avoiding disk overhead of temporary files.

Temporary File Mode

When tempdir is set, each genome's assembly is written to a separate temporary file, then concatenated to the output. This is safer in error-prone environments but less efficient. Use delete=t to clean up temporary files after successful completion.

TaxID Handling

Reassemble automatically extracts and preserves TaxID information from input genomes. TaxIDs can be embedded in:

Output contig names include the source genome's TaxID for downstream binning assessment and validation.

Common Use Cases

Binning Tool Evaluation

The primary use case is evaluating metagenomic binning tools on ground-truth datasets:

  1. Start with pre-binned genomes (known origin) or known individual genomes
  2. Run Reassemble to assemble each genome individually in isolation
  3. Feed assembled contigs with embedded TaxIDs to binning tool for evaluation
  4. Compare predicted bins against known origins to measure accuracy

Preventing Co-Assembly Artifacts

When assembling multiple related or similar genomes, individual assembly prevents chimeric contigs that span multiple genomes:

Processing Pre-Binned Metagenomes

When metagenomic samples have been pre-binned into separate genome files:

Parameters

Input/Output Parameters

in=<file|dir>
Input files. Supports comma-delimited lists, directories, and wildcards. Each input file is assembled individually.
out=<file>
(Required) Output FASTA file for assembled contigs. All assembled sequences are concatenated here.

Assembly Parameters

k=31
K-mer size for assembly. Required parameter, must be a positive integer. Larger k values are more specific but require more coverage; smaller k values are more sensitive but less specific.
mcs=1
minCountSeed: Minimum kmer count to start extension. Default is 1 for sparse individual genomes. Increase for coverage-rich assemblies to reduce noise.
mce=1
minCountExtend: Minimum kmer count to extend contigs. Default is 1 for sparse individual genomes. Adjust based on expected coverage.
mincontig=1
Minimum contig length to output. Default is 1 to preserve all assembled sequences. Increase to filter very short contigs.

Temporary File Management

tempdir=<path>
Directory for temporary files. If unset (default), Tadpole appends directly to output (more efficient). If set, each genome's assembly is written to a separate temporary file before concatenation.
delete=t
Delete temporary files after successful concatenation. Only relevant if tempdir is set. Set to false to retain temporary assemblies for debugging.

Error Handling

failfast=f
Abort on first genome assembly failure. If false (default), continue processing remaining genomes even if one fails. Set to true to stop immediately on error for quick feedback on problematic inputs.

Logging and Output

verbose=f
Print detailed progress including Tadpole commands and per-file status. Useful for debugging and monitoring long assembly runs.

Tadpole Pass-Through Parameters

Additional Tadpole parameters
Any parameter not explicitly documented above is passed directly to Tadpole for each genome assembly. Refer to Tadpole documentation for complete parameter list. Common parameters include: mindepth, maxdepth, maxbranches, mindepthseed, etc.

Java Parameters

-Xmx8g
Memory limit (default 8GB). Increase for larger genomes or set lower for memory-constrained systems. Applies to each genome assembly independently.
-eoom
Exit on out-of-memory error. If an assembly runs out of memory, terminate gracefully instead of hanging. Requires Java 8u92+.

Special Behaviors

TaxID Extraction and Embedding

Reassemble automatically extracts TaxID information from input filenames and FASTA headers, then embeds these TaxIDs in the output contig names. This preserves the source genome identity through assembly and allows tracking of contigs during binning evaluation.

Contig ID Offsetting

Contig IDs are automatically offset between genomes to ensure uniqueness across all assembled genomes. This prevents contig ID collisions when multiple genomes are assembled in a single run.

Sequential Processing and Garbage Collection

Genomes are processed sequentially, not in parallel. Between each genome assembly, garbage collection is performed to reclaim memory used by the previous assembly. This allows processing many genomes with limited memory by releasing resources as soon as each genome completes.

Summary Table

After all genomes are processed, Reassemble prints a summary table showing:

Examples

Basic Assembly of Multiple Genomes

reassemble.sh in=genomes/ out=assembled.fa k=31

Description: Assemble all FASTA files in the genomes/ directory individually with k=31. Results concatenated to assembled.fa. Uses default settings with direct append mode (no temporary files).

Assembly with TaxID-Labeled Genomes

reassemble.sh in=tid_12345_ecoli.fa,tid_67890_bsubtilis.fa out=assembled.fa k=31

Description: Assemble two genomes individually, extracting TaxIDs from filenames (12345 for E. coli, 67890 for B. subtilis). Output contigs will be labeled with their source TaxID.

Failfast Mode for Quick Validation

reassemble.sh in=genomes/ out=assembled.fa k=31 failfast=t

Description: Stop immediately on first assembly failure. Useful for testing and debugging to quickly identify problematic input genomes without waiting for entire batch to process.

Temporary File Mode with Cleanup

reassemble.sh in=genomes/ out=assembled.fa k=31 tempdir=/tmp/reassemble delete=t

Description: Each genome assembled to separate temporary file in /tmp/reassemble, then concatenated to assembled.fa. Temporary files deleted after successful completion. Safer for unreliable storage.

Verbose Assembly with Detailed Output

reassemble.sh in=genomes/ out=assembled.fa k=31 verbose=t

Description: Print Tadpole commands and per-genome status to console during assembly. Shows detailed progress for monitoring and debugging.

Custom K-mer and Assembly Parameters

reassemble.sh in=genomes/ out=assembled.fa k=25 mcs=2 mce=2 mincontig=500

Description: Assemble with k=25 (shorter kmers for lower coverage), mcs/mce=2 (require minimum 2-count kmers), mincontig=500 (filter contigs under 500bp). Useful for noisy or sparse assemblies.

Binning Evaluation Workflow

# Step 1: Assemble pre-binned genomes
reassemble.sh in=binned_genomes/ out=assembled_contigs.fa k=31

# Step 2: Run binning tool on assembled contigs
binning_tool in=assembled_contigs.fa out=predicted_bins.txt

# Step 3: Compare predicted bins against known TaxID labels

Workflow description: Start with metagenomic reads binned into separate genome files (pre-binned). Reassemble to get individual assembled contigs with TaxID labels. Run binning tool to predict which contigs belong to same genome. Compare predictions against known TaxIDs to measure binning accuracy.

Large Memory Systems

reassemble.sh -Xmx64g in=large_genomes/ out=assembled.fa k=31

Description: Allocate 64GB of memory per genome assembly for very large or complex genomes. Adjust -Xmx based on largest input genome and available system memory.

Low Memory Constrained Systems

reassemble.sh -Xmx2g in=genomes/ out=assembled.fa k=25 mcs=2 mce=2

Description: Run on systems with limited memory (2GB per genome) using smaller k-mer size and higher count thresholds. May reduce assembly quality but allows processing on constrained hardware.

Algorithm Details

Individual Assembly Strategy

Reassemble wraps the Tadpole assembler and processes each input genome independently:

  1. For each input genome file:
    • Extract or detect TaxID from filename or header
    • Run Tadpole assembly with specified k-mer size and parameters
    • Embed TaxID in output contig names
    • Append (or write to temp file) assembled contigs
    • Perform garbage collection to reclaim memory
  2. If tempdir set, concatenate all temporary files to final output
  3. Print summary table with assembly statistics per genome

Why Individual Assembly Matters

In co-assembly of multiple genomes, the assembler uses global k-mer graphs containing sequences from all genomes. When genomic regions share similarity (common in metagenomic samples), the assembler can create chimeric contigs spanning multiple genomes. Individual assembly prevents this:

Contig ID Management

To maintain globally unique contig identifiers across assembled genomes:

Sequential Processing Benefits

Reassemble processes genomes sequentially rather than in parallel:

Workflow Integration

Input Preprocessing

Reassemble expects relatively clean input genomes. For raw sequencing reads:

Output Usage

Assembled contigs can be used for:

Relationship to Other Tools

Reassemble fits into broader assembly workflows:

Support

For questions and support: