TaxTree

Basic Usage

taxtree.sh names.dmp nodes.dmp merged.dmp tree.taxtree.gz

This tool processes NCBI taxonomy dump files to create a binary taxonomic tree file. The input files should be extracted from taxdmp.zip downloaded from NCBI's taxonomy FTP site.

Parameters

TaxTree has minimal command-line parameters, with most configuration handled through Java flags and the verbose parameter from the source code.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Processing Parameters

verbose=f: Enable verbose output showing detailed information about tree construction and node processing.

Input Files

Required NCBI Files

names.dmp: Contains taxonomic names and their types. TaxTree extracts entries marked as "scientific name" to build the taxonomy.
nodes.dmp: Contains the hierarchical structure of the taxonomy, including parent-child relationships and taxonomic ranks.
merged.dmp: Maps old taxonomic IDs to new ones for taxa that have been merged or updated.
tree.taxtree.gz: Output file - the binary serialized taxonomic tree that can be loaded using ReadWrite.read() by other BBTools programs.

Examples

Basic Tree Creation

taxtree.sh names.dmp nodes.dmp merged.dmp tree.taxtree.gz

Creates a binary taxonomic tree file from NCBI taxonomy dump files.

Verbose Processing

taxtree.sh names.dmp nodes.dmp merged.dmp tree.taxtree.gz verbose=t

Same as above but with verbose output showing detailed processing information.

With Increased Memory

taxtree.sh -Xmx8g names.dmp nodes.dmp merged.dmp tree.taxtree.gz

Process large taxonomy files with 8GB of RAM allocated to Java.

Algorithm Details

TaxTree implements a multi-phase taxonomic tree construction system using fixpoint iteration algorithms and dual-level mapping designed specifically for NCBI's taxonomy database format.

Tree Construction Process

The algorithm follows a multi-phase approach:

Name Parsing: Reads names.dmp and extracts all entries marked as "scientific name", creating TaxNode objects with IDs and names.
Hierarchy Building: Processes nodes.dmp to establish parent-child relationships and assign taxonomic ranks using both standard and extended level mappings.
Child Counting: Counts children for each node using the countChildren() method to populate numChildren field and identify leaf nodes.
Information Percolation: Uses the percolate() method with bidirectional tree traversal to propagate minParentLevelExtended and maxChildLevelExtended fields using discussWithParent() until no changes occur.
Strain Assignment: For prokaryotic organisms (Bacteria taxid 2 and Archaea taxid 2157), assigns STRAIN_E and SUBSTRAIN_E ranks to unranked nodes below species level using assignStrains() method.
Tree Simplification: Optionally removes or reassigns unranked nodes based on configurable parameters.
Validation: Ensures monotonically non-decreasing taxonomic ranks from root to leaves.

Data Structures

TaxTree uses specific data structures from the Java Collections Framework and BBTools libraries:

Node Array: TaxNode[] array where nodes[taxID] directly accesses the TaxNode for that ID using array indexing
Level Arrays: treeLevelsExtended[] arrays organizing nodes by taxonomic level using nodesPerLevelExtended counters
Name Maps: HashMap<String, ArrayList<TaxNode>> for nameMap and nameMapLower with configurable sizing based on node count
Child Maps: HashMap<TaxNode, ArrayList<TaxNode>> mapping parent nodes to ArrayList of children sized using numChildren field

Level System

TaxTree implements a dual-level system:

Standard Levels: Traditional ranks like Kingdom, Phylum, Class, Order, Family, Genus, Species
Extended Levels: Includes additional ranks like Strain, Substrain, and intermediate levels for more precise classification

Performance Characteristics

Memory Usage: Approximately 200-300MB for the full NCBI taxonomy (2+ million nodes)
Construction Time: Typically 30-60 seconds depending on system performance
Lookup Performance: O(1) for TaxID-based lookups, O(log n) for name-based lookups
Tree Traversal: O(depth) for ancestor/descendant queries, typically <20 operations

Validation and Quality Control

The tree construction includes validation using the test() method:

Ensures all nodes have valid parent references (except root)
Validates taxonomic rank monotonicity
Identifies and reports problematic nodes or relationships
Handles edge cases like self-referencing nodes and circular dependencies

Output

TaxTree generates several types of output:

Binary Tree File

The primary output is a compressed binary file (tree.taxtree.gz) containing the serialized TaxTree object. This file uses Java object serialization for loading by other BBTools programs that need taxonomic information.

Processing Statistics

During construction, TaxTree reports:

Total number of nodes retained
Node counts by taxonomic level
Number of percolation rounds required
Validation results and any errors found
Processing time and memory usage

Verbose Output

With verbose=t, additional information is displayed including:

Sample nodes at each taxonomic level
Unknown taxonomic levels encountered
Details about strain assignment process
Tree simplification results

Integration with Other Tools

The TaxTree file created by this tool is used by several other BBTools programs:

Seal: For taxonomic filtering and classification of sequences
SortByTaxa: For organizing sequences by taxonomic classification
Taxonomy-aware tools: Any BBTools program that needs to resolve taxonomic names or traverse the taxonomic hierarchy

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org