ExplodeTree

Basic Usage

explodetree.sh in=<file> out=<path> tree=<file>

ExplodeTree processes FASTA files annotated with taxonomic information in the headers and creates a hierarchical directory structure with separate files for each taxonomic node. This is particularly useful for organizing RefSeq data or other taxonomically annotated sequence collections.

Parameters

Parameters control input/output locations, directory structure creation, and processing options.

Input/Output Parameters

in=: A FASTA file annotated with taxonomic data in headers, such as modified RefSeq. The headers must contain taxonomic information that can be parsed by the taxonomic tree.
out= (or path= or outpath=): Root directory path where the taxonomic tree structure will be created. Each taxonomic node will get its own subdirectory under this path.
tree= (or taxtree=): Location of taxonomic tree file. Use "auto" to automatically use the default BBTools taxonomic tree file.
prefix=: Prefix string to add to output filenames. Default: empty string. Each output file will be named as: prefix + taxid + ".fa.gz"
results= (or result=): Output file to write a summary of sequences processed per taxonomic node. Contains taxid, sequence count, taxonomic level, and taxonomic name.
extin=: Override input file extension for file format detection.

Processing Parameters

makedirectories=true (or mkdirs= or mkdir=): Create directory structure for the taxonomic tree. When true, creates all necessary directories and writes .name files containing the full taxonomic names.
verbose=false: Print verbose processing information.
maxreads=-1: Process only this many input sequences. Default: -1 (process all sequences).
overwrite=true: Overwrite existing output files. When false, appends to existing files for each taxonomic node.

Java Parameters

-Xmx: Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Usage

explodetree.sh in=refseq.fa out=organized_sequences/ tree=auto

Split RefSeq FASTA file into taxonomic directory structure using the default BBTools taxonomic tree.

Custom Tree with Results File

explodetree.sh in=sequences.fa out=taxonomy/ tree=custom.tree results=summary.txt

Use a custom taxonomic tree and generate a summary file showing sequence counts per taxonomic node.

With Filename Prefix

explodetree.sh in=data.fa out=organized/ tree=auto prefix=genome_ makedirectories=true

Add "genome_" prefix to all output filenames and ensure directory structure is created with taxonomic names.

Process Limited Number of Sequences

explodetree.sh in=large_dataset.fa out=test_output/ tree=auto maxreads=1000

Process only the first 1000 sequences for testing or quick analysis.

Algorithm Details

Core Processing Strategy

ExplodeTree implements a line-by-line streaming parser using ByteFile.nextLine() to process FASTA files:

Header Detection: Identifies FASTA headers by checking if line[0]=='>'; extracts taxonomic node using tree.parseNodeFromHeader() with header string minus '>' character
Node State Management: Maintains currentNode reference and currentSize counter; switches output files only when taxonomic node changes
File Handle Optimization: Reuses ByteStreamWriter instances when consecutive sequences map to same taxonomic node to minimize file I/O overhead
Automatic Compression: All output files use .fa.gz format with integrated gzip compression via ByteStreamWriter

Directory Structure Creation

The makeDirectoryTree() method creates hierarchical filesystem structure using TaxNode traversal:

Node Iteration: Iterates through tree.nodes array containing all TaxNode objects from loaded taxonomic tree
Path Generation: Uses tree.toDir(node, root) to convert taxonomic hierarchy into filesystem directory path
Directory Creation: Creates directories using File.mkdirs() with existence checking via File.exists()
Name File Generation: Creates [taxid].name files using node.simpleName() and ReadWrite.writeString() containing full taxonomic names

Memory-Optimized Processing

The processInner() method implements memory-efficient streaming with minimal object allocation:

Line-by-Line Processing: Uses ByteFile.nextLine() iterator pattern to avoid loading entire FASTA file into memory
Dynamic File Management: Opens ByteStreamWriter instances on-demand using bsw.poisonAndWait() for proper resource cleanup
State-Based Optimization: Only creates new output files when currentNode changes, maintaining single active ByteStreamWriter per taxonomic node
Counter Accumulation: Tracks sequence lengths in currentSize variable, updating LinkedHashMap only on taxonomic node transitions

Taxonomic Tree Integration

ExplodeTree loads and utilizes TaxTree data structure loaded via TaxTree.loadTaxTree():

Tree Initialization: Loads taxonomic tree from file using TaxTree.loadTaxTree(taxTreeFile, outstream, true, false) with validation enabled
Header Parsing: Extracts TaxNode from sequence headers using tree.parseNodeFromHeader(new String(line, 1, line.length-1), false)
Directory Mapping: Converts TaxNode to filesystem path using tree.toDir(tn, outPath) method
Auto-Detection: Uses TaxTree.defaultTreeFile() when tree parameter is "auto"

Statistics and Results Generation

When resultsFile is specified, generates tab-delimited output using TextStreamWriter:

Per-Node Metrics: LinkedHashMap<TaxNode, Long> tracks total sequence length processed for each taxonomic node
Results Format: Four-column output: tn.id + data + tn.levelStringExtended(false) + tn.name
Performance Metrics: Calculates processing rates using elapsed nanoseconds: rpnano=readsProcessed/(double)(t.elapsed)
Comprehensive Counters: Tracks readsProcessed, linesProcessed, basesProcessed, readsOut, linesOut, basesOut

Output Format

Directory Structure

ExplodeTree creates a hierarchical directory structure based on the taxonomic tree:

out_directory/
├── Bacteria/
│   ├── Proteobacteria/
│   │   ├── Gammaproteobacteria/
│   │   │   ├── 12345.fa.gz
│   │   │   └── 12345.name
│   │   └── ...
│   └── ...
├── Archaea/
│   └── ...
└── Eukaryota/
    └── ...

Output Files

Sequence Files: [prefix][taxid].fa.gz - Compressed FASTA files containing sequences for each taxonomic node
Name Files: [taxid].name - Text files containing the full taxonomic name (when makedirectories=true)
Results File: Tab-separated file with columns: taxid, sequence_count, taxonomic_level, taxonomic_name

Performance Considerations

Memory Usage

Base Memory: Default allocation is 2GB (-Xmx2000m) set by z="-Xmx2000m" variable
TaxTree Storage: Complete tree.nodes array loaded into memory containing all TaxNode objects from taxonomic database
Node Tracking: LinkedHashMap<TaxNode, Long> overhead scales with number of unique taxonomic nodes encountered
ByteFile Threading: Uses ByteFile.FORCE_MODE_BF2=true when Shared.threads()>2 for multi-threaded file reading

Disk I/O Optimization

Directory Pre-creation: makeDirectoryTree() creates entire directory structure using File.mkdirs() before processing sequences
Compression Integration: All output uses FileFormat.testOutput() with .fa.gz extension for automatic gzip compression
File Handle Management: Reuses ByteStreamWriter instances when currentNode==previousNode to minimize file open/close operations
Pigz Integration: Enables ReadWrite.USE_PIGZ=true and ReadWrite.USE_UNPIGZ=true for parallel compression

Processing Efficiency

Stream-Based Processing: ByteFile.nextLine() iterator prevents full file loading; processes files larger than available RAM
State Optimization: Only switches output files on taxonomic node changes, minimizing file system operations
Counter Batching: Accumulates sequence lengths in currentSize; updates LinkedHashMap only on node transitions
Performance Metrics: Calculates real-time processing rates: Tools.format("%.2fk reads/sec", rpnano*1000000)

Common Issues

Taxonomic Header Parsing

Input sequences must have FASTA headers compatible with tree.parseNodeFromHeader() method. The tool:

Header Processing: Strips '>' character and passes remaining string to parseNodeFromHeader()
Node Resolution: Returns null TaxNode for unparseable headers, causing sequences to be skipped
Dependency: Parsing success depends on loaded TaxTree containing matching taxonomic identifiers
Error Handling: Sequences with tn==null are not written to any output file

File System Limitations

Large taxonomic datasets trigger extensive directory creation via makeDirectoryTree():

Directory Count: Creates one directory per TaxNode in tree.nodes array
Path Depth: Hierarchical paths can exceed filesystem limits on deep taxonomic trees
Inode Usage: Each [taxid].name file plus [taxid].fa.gz file consumes filesystem inodes
File Handle Limits: ByteStreamWriter instances may exceed system ulimit for open files

Memory and Performance Issues

Memory usage scales with taxonomic diversity and TaxTree size:

TaxTree Loading: Complete tree.nodes array must fit in heap memory before processing begins
LinkedHashMap Growth: nodes HashMap grows with number of unique taxonomic nodes processed
ByteFile Threading: Multi-threaded mode (FORCE_MODE_BF2) increases memory overhead but improves I/O performance
GC Pressure: High taxonomic diversity causes frequent ByteStreamWriter creation/destruction

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org