Reassemble
Assembles multiple genome files individually using Tadpole, then concatenates the results. Unlike co-assembly, each genome is processed in isolation, preventing chimeric contigs across genome boundaries. Primarily used for evaluating metagenomic binning tools using ground-truth datasets where individual genome origins are known.
Basic Usage
reassemble.sh in=<input files> out=<output file> k=<kmer size>
Input may be a single file, comma-delimited list, directory, or wildcard pattern. Each input genome is assembled individually with the specified k-mer size. Assembled contigs are concatenated to the output file with taxID preservation.
Operational Modes
Reassemble supports two operating modes for temporary file management:
Append Mode (Default)
When tempdir is unset (default), Tadpole appends contigs directly to the output file. This is the most efficient mode, avoiding disk overhead of temporary files.
Temporary File Mode
When tempdir is set, each genome's assembly is written to a separate temporary file, then concatenated to the output. This is safer in error-prone environments but less efficient. Use delete=t to clean up temporary files after successful completion.
TaxID Handling
Reassemble automatically extracts and preserves TaxID information from input genomes. TaxIDs can be embedded in:
- Filename patterns: tid_12345_genome.fa or similar patterns
- FASTA headers: TaxID extracted from header metadata
Output contig names include the source genome's TaxID for downstream binning assessment and validation.
Common Use Cases
Binning Tool Evaluation
The primary use case is evaluating metagenomic binning tools on ground-truth datasets:
- Start with pre-binned genomes (known origin) or known individual genomes
- Run Reassemble to assemble each genome individually in isolation
- Feed assembled contigs with embedded TaxIDs to binning tool for evaluation
- Compare predicted bins against known origins to measure accuracy
Preventing Co-Assembly Artifacts
When assembling multiple related or similar genomes, individual assembly prevents chimeric contigs that span multiple genomes:
- Co-assembly risk: Similar regions between genomes can create false contigs combining sequences from multiple origins
- Individual assembly benefit: Each genome assembled in isolation, ensuring all contigs are from single source
- Consequence: Ground-truth datasets remain clean for benchmarking and validation
Processing Pre-Binned Metagenomes
When metagenomic samples have been pre-binned into separate genome files:
- Run Reassemble to generate assembled contigs for each bin
- Use assembled output for quality assessment, annotation, or further analysis
- TaxID preservation allows tracking of each contig's origin bin
Parameters
Input/Output Parameters
- in=<file|dir>
- Input files. Supports comma-delimited lists, directories, and wildcards. Each input file is assembled individually.
- out=<file>
- (Required) Output FASTA file for assembled contigs. All assembled sequences are concatenated here.
Assembly Parameters
- k=31
- K-mer size for assembly. Required parameter, must be a positive integer. Larger k values are more specific but require more coverage; smaller k values are more sensitive but less specific.
- mcs=1
- minCountSeed: Minimum kmer count to start extension. Default is 1 for sparse individual genomes. Increase for coverage-rich assemblies to reduce noise.
- mce=1
- minCountExtend: Minimum kmer count to extend contigs. Default is 1 for sparse individual genomes. Adjust based on expected coverage.
- mincontig=1
- Minimum contig length to output. Default is 1 to preserve all assembled sequences. Increase to filter very short contigs.
Temporary File Management
- tempdir=<path>
- Directory for temporary files. If unset (default), Tadpole appends directly to output (more efficient). If set, each genome's assembly is written to a separate temporary file before concatenation.
- delete=t
- Delete temporary files after successful concatenation. Only relevant if tempdir is set. Set to false to retain temporary assemblies for debugging.
Error Handling
- failfast=f
- Abort on first genome assembly failure. If false (default), continue processing remaining genomes even if one fails. Set to true to stop immediately on error for quick feedback on problematic inputs.
Logging and Output
- verbose=f
- Print detailed progress including Tadpole commands and per-file status. Useful for debugging and monitoring long assembly runs.
Tadpole Pass-Through Parameters
- Additional Tadpole parameters
- Any parameter not explicitly documented above is passed directly to Tadpole for each genome assembly. Refer to Tadpole documentation for complete parameter list. Common parameters include: mindepth, maxdepth, maxbranches, mindepthseed, etc.
Java Parameters
- -Xmx8g
- Memory limit (default 8GB). Increase for larger genomes or set lower for memory-constrained systems. Applies to each genome assembly independently.
- -eoom
- Exit on out-of-memory error. If an assembly runs out of memory, terminate gracefully instead of hanging. Requires Java 8u92+.
Special Behaviors
TaxID Extraction and Embedding
Reassemble automatically extracts TaxID information from input filenames and FASTA headers, then embeds these TaxIDs in the output contig names. This preserves the source genome identity through assembly and allows tracking of contigs during binning evaluation.
Contig ID Offsetting
Contig IDs are automatically offset between genomes to ensure uniqueness across all assembled genomes. This prevents contig ID collisions when multiple genomes are assembled in a single run.
Sequential Processing and Garbage Collection
Genomes are processed sequentially, not in parallel. Between each genome assembly, garbage collection is performed to reclaim memory used by the previous assembly. This allows processing many genomes with limited memory by releasing resources as soon as each genome completes.
Summary Table
After all genomes are processed, Reassemble prints a summary table showing:
- Genome: Input filename
- TaxID: Extracted or detected TaxID
- Status: Success or failure code
- Contigs: Number of contigs assembled
- Bases: Total bases in assembled contigs
Examples
Basic Assembly of Multiple Genomes
reassemble.sh in=genomes/ out=assembled.fa k=31
Description: Assemble all FASTA files in the genomes/ directory individually with k=31. Results concatenated to assembled.fa. Uses default settings with direct append mode (no temporary files).
Assembly with TaxID-Labeled Genomes
reassemble.sh in=tid_12345_ecoli.fa,tid_67890_bsubtilis.fa out=assembled.fa k=31
Description: Assemble two genomes individually, extracting TaxIDs from filenames (12345 for E. coli, 67890 for B. subtilis). Output contigs will be labeled with their source TaxID.
Failfast Mode for Quick Validation
reassemble.sh in=genomes/ out=assembled.fa k=31 failfast=t
Description: Stop immediately on first assembly failure. Useful for testing and debugging to quickly identify problematic input genomes without waiting for entire batch to process.
Temporary File Mode with Cleanup
reassemble.sh in=genomes/ out=assembled.fa k=31 tempdir=/tmp/reassemble delete=t
Description: Each genome assembled to separate temporary file in /tmp/reassemble, then concatenated to assembled.fa. Temporary files deleted after successful completion. Safer for unreliable storage.
Verbose Assembly with Detailed Output
reassemble.sh in=genomes/ out=assembled.fa k=31 verbose=t
Description: Print Tadpole commands and per-genome status to console during assembly. Shows detailed progress for monitoring and debugging.
Custom K-mer and Assembly Parameters
reassemble.sh in=genomes/ out=assembled.fa k=25 mcs=2 mce=2 mincontig=500
Description: Assemble with k=25 (shorter kmers for lower coverage), mcs/mce=2 (require minimum 2-count kmers), mincontig=500 (filter contigs under 500bp). Useful for noisy or sparse assemblies.
Binning Evaluation Workflow
# Step 1: Assemble pre-binned genomes
reassemble.sh in=binned_genomes/ out=assembled_contigs.fa k=31
# Step 2: Run binning tool on assembled contigs
binning_tool in=assembled_contigs.fa out=predicted_bins.txt
# Step 3: Compare predicted bins against known TaxID labels
Workflow description: Start with metagenomic reads binned into separate genome files (pre-binned). Reassemble to get individual assembled contigs with TaxID labels. Run binning tool to predict which contigs belong to same genome. Compare predictions against known TaxIDs to measure binning accuracy.
Large Memory Systems
reassemble.sh -Xmx64g in=large_genomes/ out=assembled.fa k=31
Description: Allocate 64GB of memory per genome assembly for very large or complex genomes. Adjust -Xmx based on largest input genome and available system memory.
Low Memory Constrained Systems
reassemble.sh -Xmx2g in=genomes/ out=assembled.fa k=25 mcs=2 mce=2
Description: Run on systems with limited memory (2GB per genome) using smaller k-mer size and higher count thresholds. May reduce assembly quality but allows processing on constrained hardware.
Algorithm Details
Individual Assembly Strategy
Reassemble wraps the Tadpole assembler and processes each input genome independently:
- For each input genome file:
- Extract or detect TaxID from filename or header
- Run Tadpole assembly with specified k-mer size and parameters
- Embed TaxID in output contig names
- Append (or write to temp file) assembled contigs
- Perform garbage collection to reclaim memory
- If tempdir set, concatenate all temporary files to final output
- Print summary table with assembly statistics per genome
Why Individual Assembly Matters
In co-assembly of multiple genomes, the assembler uses global k-mer graphs containing sequences from all genomes. When genomic regions share similarity (common in metagenomic samples), the assembler can create chimeric contigs spanning multiple genomes. Individual assembly prevents this:
- Co-assembly: K-mer graph contains all genomes → similar regions bridge → chimeric contigs
- Individual assembly: Each genome has isolated k-mer graph → no inter-genome bridges → contigs remain pure
- Binning impact: Chimeric contigs confound binning tool evaluation by mixing signals from multiple sources
Contig ID Management
To maintain globally unique contig identifiers across assembled genomes:
- Contig IDs are automatically offset based on genome processing order
- Example: First genome contigs named contig_0_1, contig_0_2, etc.; second genome contig_1_1, contig_1_2, etc.
- This prevents ID collisions when concatenating results from multiple independent assemblies
Sequential Processing Benefits
Reassemble processes genomes sequentially rather than in parallel:
- Memory efficiency: Each assembly's memory is released before next genome starts
- Scalability: Can process 1000+ genomes on system with memory for only 2-3 simultaneous assemblies
- Simplicity: No thread synchronization overhead or race conditions
- Predictability: Memory usage remains constant regardless of input size
Workflow Integration
Input Preprocessing
Reassemble expects relatively clean input genomes. For raw sequencing reads:
- Pre-process with BBDuk to remove adapters and low-quality bases
- Pre-bin reads using existing binning tools or ground-truth labels
- Save each bin to separate FASTA file
- Feed to Reassemble for individual assembly
Output Usage
Assembled contigs can be used for:
- Binning evaluation: Feed to binning tools and compare predictions against TaxID labels
- Functional annotation: Annotate contigs to understand bin composition
- Quality assessment: Measure assembly completeness, contamination, and fragmentation
- Reference database: Use assembled genomes as reference for downstream analysis
Relationship to Other Tools
Reassemble fits into broader assembly workflows:
- Tadpole: The underlying de Bruijn graph assembler (Reassemble is a wrapper)
- BBDuk: Use for preprocessing raw reads before binning
- BBMap: Align reads to assembled contigs for validation
- Binning tools: Evaluate using Reassemble output with embedded TaxID labels
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Tadpole documentation: For advanced assembly parameters and k-mer selection strategies