TaxTree
Creates tree.taxtree from names.dmp and nodes.dmp. These are in taxdmp.zip available at ftp://ftp.ncbi.nih.gov/pub/taxonomy/ The taxtree file is needed for programs that can deal with taxonomy, like Seal and SortByTaxa.
Basic Usage
taxtree.sh names.dmp nodes.dmp merged.dmp tree.taxtree.gz
This tool processes NCBI taxonomy dump files to create a binary taxonomic tree file. The input files should be extracted from taxdmp.zip downloaded from NCBI's taxonomy FTP site.
Parameters
TaxTree has minimal command-line parameters, with most configuration handled through Java flags and the verbose parameter from the source code.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Processing Parameters
- verbose=f
- Enable verbose output showing detailed information about tree construction and node processing.
Input Files
Required NCBI Files
- names.dmp
- Contains taxonomic names and their types. TaxTree extracts entries marked as "scientific name" to build the taxonomy.
- nodes.dmp
- Contains the hierarchical structure of the taxonomy, including parent-child relationships and taxonomic ranks.
- merged.dmp
- Maps old taxonomic IDs to new ones for taxa that have been merged or updated.
- tree.taxtree.gz
- Output file - the binary serialized taxonomic tree that can be loaded using ReadWrite.read() by other BBTools programs.
Examples
Basic Tree Creation
taxtree.sh names.dmp nodes.dmp merged.dmp tree.taxtree.gz
Creates a binary taxonomic tree file from NCBI taxonomy dump files.
Verbose Processing
taxtree.sh names.dmp nodes.dmp merged.dmp tree.taxtree.gz verbose=t
Same as above but with verbose output showing detailed processing information.
With Increased Memory
taxtree.sh -Xmx8g names.dmp nodes.dmp merged.dmp tree.taxtree.gz
Process large taxonomy files with 8GB of RAM allocated to Java.
Algorithm Details
TaxTree implements a multi-phase taxonomic tree construction system using fixpoint iteration algorithms and dual-level mapping designed specifically for NCBI's taxonomy database format.
Tree Construction Process
The algorithm follows a multi-phase approach:
- Name Parsing: Reads names.dmp and extracts all entries marked as "scientific name", creating TaxNode objects with IDs and names.
- Hierarchy Building: Processes nodes.dmp to establish parent-child relationships and assign taxonomic ranks using both standard and extended level mappings.
- Child Counting: Counts children for each node using the countChildren() method to populate numChildren field and identify leaf nodes.
- Information Percolation: Uses the percolate() method with bidirectional tree traversal to propagate minParentLevelExtended and maxChildLevelExtended fields using discussWithParent() until no changes occur.
- Strain Assignment: For prokaryotic organisms (Bacteria taxid 2 and Archaea taxid 2157), assigns STRAIN_E and SUBSTRAIN_E ranks to unranked nodes below species level using assignStrains() method.
- Tree Simplification: Optionally removes or reassigns unranked nodes based on configurable parameters.
- Validation: Ensures monotonically non-decreasing taxonomic ranks from root to leaves.
Data Structures
TaxTree uses specific data structures from the Java Collections Framework and BBTools libraries:
- Node Array: TaxNode[] array where nodes[taxID] directly accesses the TaxNode for that ID using array indexing
- Level Arrays: treeLevelsExtended[] arrays organizing nodes by taxonomic level using nodesPerLevelExtended counters
- Name Maps: HashMap<String, ArrayList<TaxNode>> for nameMap and nameMapLower with configurable sizing based on node count
- Child Maps: HashMap<TaxNode, ArrayList<TaxNode>> mapping parent nodes to ArrayList of children sized using numChildren field
Level System
TaxTree implements a dual-level system:
- Standard Levels: Traditional ranks like Kingdom, Phylum, Class, Order, Family, Genus, Species
- Extended Levels: Includes additional ranks like Strain, Substrain, and intermediate levels for more precise classification
Performance Characteristics
- Memory Usage: Approximately 200-300MB for the full NCBI taxonomy (2+ million nodes)
- Construction Time: Typically 30-60 seconds depending on system performance
- Lookup Performance: O(1) for TaxID-based lookups, O(log n) for name-based lookups
- Tree Traversal: O(depth) for ancestor/descendant queries, typically <20 operations
Validation and Quality Control
The tree construction includes validation using the test() method:
- Ensures all nodes have valid parent references (except root)
- Validates taxonomic rank monotonicity
- Identifies and reports problematic nodes or relationships
- Handles edge cases like self-referencing nodes and circular dependencies
Output
TaxTree generates several types of output:
Binary Tree File
The primary output is a compressed binary file (tree.taxtree.gz) containing the serialized TaxTree object. This file uses Java object serialization for loading by other BBTools programs that need taxonomic information.
Processing Statistics
During construction, TaxTree reports:
- Total number of nodes retained
- Node counts by taxonomic level
- Number of percolation rounds required
- Validation results and any errors found
- Processing time and memory usage
Verbose Output
With verbose=t, additional information is displayed including:
- Sample nodes at each taxonomic level
- Unknown taxonomic levels encountered
- Details about strain assignment process
- Tree simplification results
Integration with Other Tools
The TaxTree file created by this tool is used by several other BBTools programs:
- Seal: For taxonomic filtering and classification of sequences
- SortByTaxa: For organizing sequences by taxonomic classification
- Taxonomy-aware tools: Any BBTools program that needs to resolve taxonomic names or traverse the taxonomic hierarchy
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org