TaxTree

Script: taxtree.sh Package: tax Class: TaxTree.java

Creates tree.taxtree from names.dmp and nodes.dmp. These are in taxdmp.zip available at ftp://ftp.ncbi.nih.gov/pub/taxonomy/ The taxtree file is needed for programs that can deal with taxonomy, like Seal and SortByTaxa.

Basic Usage

taxtree.sh names.dmp nodes.dmp merged.dmp tree.taxtree.gz

This tool processes NCBI taxonomy dump files to create a binary taxonomic tree file. The input files should be extracted from taxdmp.zip downloaded from NCBI's taxonomy FTP site.

Parameters

TaxTree has minimal command-line parameters, with most configuration handled through Java flags and the verbose parameter from the source code.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Processing Parameters

verbose=f
Enable verbose output showing detailed information about tree construction and node processing.

Input Files

Required NCBI Files

names.dmp
Contains taxonomic names and their types. TaxTree extracts entries marked as "scientific name" to build the taxonomy.
nodes.dmp
Contains the hierarchical structure of the taxonomy, including parent-child relationships and taxonomic ranks.
merged.dmp
Maps old taxonomic IDs to new ones for taxa that have been merged or updated.
tree.taxtree.gz
Output file - the binary serialized taxonomic tree that can be loaded using ReadWrite.read() by other BBTools programs.

Examples

Basic Tree Creation

taxtree.sh names.dmp nodes.dmp merged.dmp tree.taxtree.gz

Creates a binary taxonomic tree file from NCBI taxonomy dump files.

Verbose Processing

taxtree.sh names.dmp nodes.dmp merged.dmp tree.taxtree.gz verbose=t

Same as above but with verbose output showing detailed processing information.

With Increased Memory

taxtree.sh -Xmx8g names.dmp nodes.dmp merged.dmp tree.taxtree.gz

Process large taxonomy files with 8GB of RAM allocated to Java.

Algorithm Details

TaxTree implements a multi-phase taxonomic tree construction system using fixpoint iteration algorithms and dual-level mapping designed specifically for NCBI's taxonomy database format.

Tree Construction Process

The algorithm follows a multi-phase approach:

  1. Name Parsing: Reads names.dmp and extracts all entries marked as "scientific name", creating TaxNode objects with IDs and names.
  2. Hierarchy Building: Processes nodes.dmp to establish parent-child relationships and assign taxonomic ranks using both standard and extended level mappings.
  3. Child Counting: Counts children for each node using the countChildren() method to populate numChildren field and identify leaf nodes.
  4. Information Percolation: Uses the percolate() method with bidirectional tree traversal to propagate minParentLevelExtended and maxChildLevelExtended fields using discussWithParent() until no changes occur.
  5. Strain Assignment: For prokaryotic organisms (Bacteria taxid 2 and Archaea taxid 2157), assigns STRAIN_E and SUBSTRAIN_E ranks to unranked nodes below species level using assignStrains() method.
  6. Tree Simplification: Optionally removes or reassigns unranked nodes based on configurable parameters.
  7. Validation: Ensures monotonically non-decreasing taxonomic ranks from root to leaves.

Data Structures

TaxTree uses specific data structures from the Java Collections Framework and BBTools libraries:

Level System

TaxTree implements a dual-level system:

Performance Characteristics

Validation and Quality Control

The tree construction includes validation using the test() method:

Output

TaxTree generates several types of output:

Binary Tree File

The primary output is a compressed binary file (tree.taxtree.gz) containing the serialized TaxTree object. This file uses Java object serialization for loading by other BBTools programs that need taxonomic information.

Processing Statistics

During construction, TaxTree reports:

Verbose Output

With verbose=t, additional information is displayed including:

Integration with Other Tools

The TaxTree file created by this tool is used by several other BBTools programs:

Support

For questions and support: