ExplodeTree
Constructs a directory and file tree of sequences corresponding to a taxonomic tree. Takes a FASTA file with taxonomic annotations and splits sequences into separate files organized in a hierarchical directory structure that mirrors the taxonomic tree.
Basic Usage
explodetree.sh in=<file> out=<path> tree=<file>
ExplodeTree processes FASTA files annotated with taxonomic information in the headers and creates a hierarchical directory structure with separate files for each taxonomic node. This is particularly useful for organizing RefSeq data or other taxonomically annotated sequence collections.
Parameters
Parameters control input/output locations, directory structure creation, and processing options.
Input/Output Parameters
- in=
- A FASTA file annotated with taxonomic data in headers, such as modified RefSeq. The headers must contain taxonomic information that can be parsed by the taxonomic tree.
- out= (or path= or outpath=)
- Root directory path where the taxonomic tree structure will be created. Each taxonomic node will get its own subdirectory under this path.
- tree= (or taxtree=)
- Location of taxonomic tree file. Use "auto" to automatically use the default BBTools taxonomic tree file.
- prefix=
- Prefix string to add to output filenames. Default: empty string. Each output file will be named as: prefix + taxid + ".fa.gz"
- results= (or result=)
- Output file to write a summary of sequences processed per taxonomic node. Contains taxid, sequence count, taxonomic level, and taxonomic name.
- extin=
- Override input file extension for file format detection.
Processing Parameters
- makedirectories=true (or mkdirs= or mkdir=)
- Create directory structure for the taxonomic tree. When true, creates all necessary directories and writes .name files containing the full taxonomic names.
- verbose=false
- Print verbose processing information.
- maxreads=-1
- Process only this many input sequences. Default: -1 (process all sequences).
- overwrite=true
- Overwrite existing output files. When false, appends to existing files for each taxonomic node.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Usage
explodetree.sh in=refseq.fa out=organized_sequences/ tree=auto
Split RefSeq FASTA file into taxonomic directory structure using the default BBTools taxonomic tree.
Custom Tree with Results File
explodetree.sh in=sequences.fa out=taxonomy/ tree=custom.tree results=summary.txt
Use a custom taxonomic tree and generate a summary file showing sequence counts per taxonomic node.
With Filename Prefix
explodetree.sh in=data.fa out=organized/ tree=auto prefix=genome_ makedirectories=true
Add "genome_" prefix to all output filenames and ensure directory structure is created with taxonomic names.
Process Limited Number of Sequences
explodetree.sh in=large_dataset.fa out=test_output/ tree=auto maxreads=1000
Process only the first 1000 sequences for testing or quick analysis.
Algorithm Details
Core Processing Strategy
ExplodeTree implements a line-by-line streaming parser using ByteFile.nextLine() to process FASTA files:
- Header Detection: Identifies FASTA headers by checking if line[0]=='>'; extracts taxonomic node using tree.parseNodeFromHeader() with header string minus '>' character
- Node State Management: Maintains currentNode reference and currentSize counter; switches output files only when taxonomic node changes
- File Handle Optimization: Reuses ByteStreamWriter instances when consecutive sequences map to same taxonomic node to minimize file I/O overhead
- Automatic Compression: All output files use .fa.gz format with integrated gzip compression via ByteStreamWriter
Directory Structure Creation
The makeDirectoryTree() method creates hierarchical filesystem structure using TaxNode traversal:
- Node Iteration: Iterates through tree.nodes array containing all TaxNode objects from loaded taxonomic tree
- Path Generation: Uses tree.toDir(node, root) to convert taxonomic hierarchy into filesystem directory path
- Directory Creation: Creates directories using File.mkdirs() with existence checking via File.exists()
- Name File Generation: Creates [taxid].name files using node.simpleName() and ReadWrite.writeString() containing full taxonomic names
Memory-Optimized Processing
The processInner() method implements memory-efficient streaming with minimal object allocation:
- Line-by-Line Processing: Uses ByteFile.nextLine() iterator pattern to avoid loading entire FASTA file into memory
- Dynamic File Management: Opens ByteStreamWriter instances on-demand using bsw.poisonAndWait() for proper resource cleanup
- State-Based Optimization: Only creates new output files when currentNode changes, maintaining single active ByteStreamWriter per taxonomic node
- Counter Accumulation: Tracks sequence lengths in currentSize variable, updating LinkedHashMap only on taxonomic node transitions
Taxonomic Tree Integration
ExplodeTree loads and utilizes TaxTree data structure loaded via TaxTree.loadTaxTree():
- Tree Initialization: Loads taxonomic tree from file using TaxTree.loadTaxTree(taxTreeFile, outstream, true, false) with validation enabled
- Header Parsing: Extracts TaxNode from sequence headers using tree.parseNodeFromHeader(new String(line, 1, line.length-1), false)
- Directory Mapping: Converts TaxNode to filesystem path using tree.toDir(tn, outPath) method
- Auto-Detection: Uses TaxTree.defaultTreeFile() when tree parameter is "auto"
Statistics and Results Generation
When resultsFile is specified, generates tab-delimited output using TextStreamWriter:
- Per-Node Metrics: LinkedHashMap<TaxNode, Long> tracks total sequence length processed for each taxonomic node
- Results Format: Four-column output: tn.id + data + tn.levelStringExtended(false) + tn.name
- Performance Metrics: Calculates processing rates using elapsed nanoseconds: rpnano=readsProcessed/(double)(t.elapsed)
- Comprehensive Counters: Tracks readsProcessed, linesProcessed, basesProcessed, readsOut, linesOut, basesOut
Output Format
Directory Structure
ExplodeTree creates a hierarchical directory structure based on the taxonomic tree:
out_directory/
├── Bacteria/
│ ├── Proteobacteria/
│ │ ├── Gammaproteobacteria/
│ │ │ ├── 12345.fa.gz
│ │ │ └── 12345.name
│ │ └── ...
│ └── ...
├── Archaea/
│ └── ...
└── Eukaryota/
└── ...
Output Files
- Sequence Files: [prefix][taxid].fa.gz - Compressed FASTA files containing sequences for each taxonomic node
- Name Files: [taxid].name - Text files containing the full taxonomic name (when makedirectories=true)
- Results File: Tab-separated file with columns: taxid, sequence_count, taxonomic_level, taxonomic_name
Performance Considerations
Memory Usage
- Base Memory: Default allocation is 2GB (-Xmx2000m) set by z="-Xmx2000m" variable
- TaxTree Storage: Complete tree.nodes array loaded into memory containing all TaxNode objects from taxonomic database
- Node Tracking: LinkedHashMap<TaxNode, Long> overhead scales with number of unique taxonomic nodes encountered
- ByteFile Threading: Uses ByteFile.FORCE_MODE_BF2=true when Shared.threads()>2 for multi-threaded file reading
Disk I/O Optimization
- Directory Pre-creation: makeDirectoryTree() creates entire directory structure using File.mkdirs() before processing sequences
- Compression Integration: All output uses FileFormat.testOutput() with .fa.gz extension for automatic gzip compression
- File Handle Management: Reuses ByteStreamWriter instances when currentNode==previousNode to minimize file open/close operations
- Pigz Integration: Enables ReadWrite.USE_PIGZ=true and ReadWrite.USE_UNPIGZ=true for parallel compression
Processing Efficiency
- Stream-Based Processing: ByteFile.nextLine() iterator prevents full file loading; processes files larger than available RAM
- State Optimization: Only switches output files on taxonomic node changes, minimizing file system operations
- Counter Batching: Accumulates sequence lengths in currentSize; updates LinkedHashMap only on node transitions
- Performance Metrics: Calculates real-time processing rates: Tools.format("%.2fk reads/sec", rpnano*1000000)
Common Issues
Taxonomic Header Parsing
Input sequences must have FASTA headers compatible with tree.parseNodeFromHeader() method. The tool:
- Header Processing: Strips '>' character and passes remaining string to parseNodeFromHeader()
- Node Resolution: Returns null TaxNode for unparseable headers, causing sequences to be skipped
- Dependency: Parsing success depends on loaded TaxTree containing matching taxonomic identifiers
- Error Handling: Sequences with tn==null are not written to any output file
File System Limitations
Large taxonomic datasets trigger extensive directory creation via makeDirectoryTree():
- Directory Count: Creates one directory per TaxNode in tree.nodes array
- Path Depth: Hierarchical paths can exceed filesystem limits on deep taxonomic trees
- Inode Usage: Each [taxid].name file plus [taxid].fa.gz file consumes filesystem inodes
- File Handle Limits: ByteStreamWriter instances may exceed system ulimit for open files
Memory and Performance Issues
Memory usage scales with taxonomic diversity and TaxTree size:
- TaxTree Loading: Complete tree.nodes array must fit in heap memory before processing begins
- LinkedHashMap Growth: nodes HashMap grows with number of unique taxonomic nodes processed
- ByteFile Threading: Multi-threaded mode (FORCE_MODE_BF2) increases memory overhead but improves I/O performance
- GC Pressure: High taxonomic diversity causes frequent ByteStreamWriter creation/destruction
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org