ExplodeTree

Script: explodetree.sh Package: tax Class: ExplodeTree.java

Constructs a directory and file tree of sequences corresponding to a taxonomic tree. Takes a FASTA file with taxonomic annotations and splits sequences into separate files organized in a hierarchical directory structure that mirrors the taxonomic tree.

Basic Usage

explodetree.sh in=<file> out=<path> tree=<file>

ExplodeTree processes FASTA files annotated with taxonomic information in the headers and creates a hierarchical directory structure with separate files for each taxonomic node. This is particularly useful for organizing RefSeq data or other taxonomically annotated sequence collections.

Parameters

Parameters control input/output locations, directory structure creation, and processing options.

Input/Output Parameters

in=
A FASTA file annotated with taxonomic data in headers, such as modified RefSeq. The headers must contain taxonomic information that can be parsed by the taxonomic tree.
out= (or path= or outpath=)
Root directory path where the taxonomic tree structure will be created. Each taxonomic node will get its own subdirectory under this path.
tree= (or taxtree=)
Location of taxonomic tree file. Use "auto" to automatically use the default BBTools taxonomic tree file.
prefix=
Prefix string to add to output filenames. Default: empty string. Each output file will be named as: prefix + taxid + ".fa.gz"
results= (or result=)
Output file to write a summary of sequences processed per taxonomic node. Contains taxid, sequence count, taxonomic level, and taxonomic name.
extin=
Override input file extension for file format detection.

Processing Parameters

makedirectories=true (or mkdirs= or mkdir=)
Create directory structure for the taxonomic tree. When true, creates all necessary directories and writes .name files containing the full taxonomic names.
verbose=false
Print verbose processing information.
maxreads=-1
Process only this many input sequences. Default: -1 (process all sequences).
overwrite=true
Overwrite existing output files. When false, appends to existing files for each taxonomic node.

Java Parameters

-Xmx
Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Usage

explodetree.sh in=refseq.fa out=organized_sequences/ tree=auto

Split RefSeq FASTA file into taxonomic directory structure using the default BBTools taxonomic tree.

Custom Tree with Results File

explodetree.sh in=sequences.fa out=taxonomy/ tree=custom.tree results=summary.txt

Use a custom taxonomic tree and generate a summary file showing sequence counts per taxonomic node.

With Filename Prefix

explodetree.sh in=data.fa out=organized/ tree=auto prefix=genome_ makedirectories=true

Add "genome_" prefix to all output filenames and ensure directory structure is created with taxonomic names.

Process Limited Number of Sequences

explodetree.sh in=large_dataset.fa out=test_output/ tree=auto maxreads=1000

Process only the first 1000 sequences for testing or quick analysis.

Algorithm Details

Core Processing Strategy

ExplodeTree implements a line-by-line streaming parser using ByteFile.nextLine() to process FASTA files:

Directory Structure Creation

The makeDirectoryTree() method creates hierarchical filesystem structure using TaxNode traversal:

Memory-Optimized Processing

The processInner() method implements memory-efficient streaming with minimal object allocation:

Taxonomic Tree Integration

ExplodeTree loads and utilizes TaxTree data structure loaded via TaxTree.loadTaxTree():

Statistics and Results Generation

When resultsFile is specified, generates tab-delimited output using TextStreamWriter:

Output Format

Directory Structure

ExplodeTree creates a hierarchical directory structure based on the taxonomic tree:

out_directory/
├── Bacteria/
│   ├── Proteobacteria/
│   │   ├── Gammaproteobacteria/
│   │   │   ├── 12345.fa.gz
│   │   │   └── 12345.name
│   │   └── ...
│   └── ...
├── Archaea/
│   └── ...
└── Eukaryota/
    └── ...

Output Files

Performance Considerations

Memory Usage

Disk I/O Optimization

Processing Efficiency

Common Issues

Taxonomic Header Parsing

Input sequences must have FASTA headers compatible with tree.parseNodeFromHeader() method. The tool:

File System Limitations

Large taxonomic datasets trigger extensive directory creation via makeDirectoryTree():

Memory and Performance Issues

Memory usage scales with taxonomic diversity and TaxTree size:

Support

For questions and support: