IDTree

Basic Usage

idtree.sh in=<input file> out=<output file>

IDTree reads an identity matrix in TSV (tab-separated values) format and constructs a phylogenetic tree in Newick format. The tool is designed to work with identity matrices generated by the idmatrix.sh tool, creating hierarchical clustering trees based on sequence similarity data.

Parameters

IDTree parameters are organized into standard file handling parameters and Java runtime parameters. The tool has a simple parameter set focused on input/output specification and memory management.

Standard parameters

in=<file>: Identity matrix in TSV format. The input file should contain a symmetric matrix where each row represents a sequence and each column represents the identity score to another sequence. The first column should contain sequence names, and subsequent columns contain numerical identity values.
out=<file>: Newick tree output. The output file will contain the constructed phylogenetic tree in standard Newick format, terminated with a semicolon. If not specified, output goes to stdout.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. When set to true, existing output files will be overwritten without warning. Default: false

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 800m for IDTree.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing hanging processes when memory is exhausted during tree construction.
-da: Disable assertions. This can provide a small performance improvement in production environments by skipping internal consistency checks.

Examples

Basic Tree Construction

idtree.sh in=identity_matrix.tsv out=phylo_tree.newick

Reads an identity matrix from identity_matrix.tsv and writes the resulting phylogenetic tree to phylo_tree.newick in standard Newick format.

Pipeline with IDMatrix

# First create identity matrix
idmatrix.sh in=sequences.fasta out=identity_matrix.tsv

# Then build tree from matrix
idtree.sh in=identity_matrix.tsv out=phylo_tree.newick

Complete workflow showing how to generate an identity matrix from sequences and then construct a phylogenetic tree from that matrix.

High Memory Usage

idtree.sh in=large_matrix.tsv out=tree.newick -Xmx8g

For large identity matrices with many sequences, increase memory allocation to 8GB to handle the tree construction process efficiently.

Algorithm Details

IDTree implements hierarchical agglomerative clustering using the IDNode.makeTree() method. The tree construction follows a greedy priority queue-based algorithm that builds phylogenetic trees from sequence identity matrices through iterative node merging.

Tree Construction Algorithm

The core algorithm follows these steps:

Node Initialization: Each sequence in the identity matrix becomes a leaf node (IDNode) containing its identity scores to all other sequences
Priority Queue Management: Nodes are maintained in a priority queue ordered by their maximum identity score to remaining nodes
Iterative Clustering: The algorithm repeatedly selects the node with the highest maximum similarity and merges it with its most similar partner
Distance Calculation: Branch lengths are calculated based on identity score differences, with the formula: length = max(left.max, right.max) - parent.max
Newick Output: The final tree is serialized to standard Newick format with calculated branch lengths

Data Structures

The implementation uses specialized data structures for efficiency:

IDNode Class: Represents both leaf and internal nodes, containing identity arrays, parent/child relationships, and BitSet tracking for processed sequences
Priority Queue: Maintains nodes sorted by maximum identity score for efficient selection of next clustering candidates
BitSet Tracking: Uses BitSet data structures to efficiently track which sequences have been clustered together

Memory Characteristics

Memory usage scales quadratically with the number of input sequences due to the identity matrix storage. For N sequences:

Matrix Storage: O(N²) space for storing all pairwise identity scores
Node Storage: O(N) space for tree nodes, with each internal node requiring additional array storage
Working Memory: Additional memory for priority queue operations and BitSet tracking

Input Format Requirements

The identity matrix must follow specific formatting requirements:

TSV Format: Tab-separated values with sequence names in the first column
Symmetric Matrix: Identity scores should be symmetric (score[i][j] = score[j][i])
Numerical Values: All identity scores must be valid double-precision numbers
Header Handling: Lines starting with '#' are treated as comments and ignored

Tree Quality and Interpretation

The resulting Newick tree has specific characteristics:

Branch Lengths: Represent identity score differences, with shorter branches indicating higher similarity
Topology: Reflects hierarchical clustering based on maximum identity scores at each step
Name Sanitization: Special Newick characters (parentheses, colons, commas, semicolons, whitespace) are replaced with underscores
Precision: Branch lengths are formatted to 4 decimal places for balance between precision and readability

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org