IDTree
Makes a Newick tree from an identity matrix. Intended for use with matrices created by idmatrix.sh.
Basic Usage
idtree.sh in=<input file> out=<output file>
IDTree reads an identity matrix in TSV (tab-separated values) format and constructs a phylogenetic tree in Newick format. The tool is designed to work with identity matrices generated by the idmatrix.sh tool, creating hierarchical clustering trees based on sequence similarity data.
Parameters
IDTree parameters are organized into standard file handling parameters and Java runtime parameters. The tool has a simple parameter set focused on input/output specification and memory management.
Standard parameters
- in=<file>
- Identity matrix in TSV format. The input file should contain a symmetric matrix where each row represents a sequence and each column represents the identity score to another sequence. The first column should contain sequence names, and subsequent columns contain numerical identity values.
- out=<file>
- Newick tree output. The output file will contain the constructed phylogenetic tree in standard Newick format, terminated with a semicolon. If not specified, output goes to stdout.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. When set to true, existing output files will be overwritten without warning. Default: false
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 800m for IDTree.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing hanging processes when memory is exhausted during tree construction.
- -da
- Disable assertions. This can provide a small performance improvement in production environments by skipping internal consistency checks.
Examples
Basic Tree Construction
idtree.sh in=identity_matrix.tsv out=phylo_tree.newick
Reads an identity matrix from identity_matrix.tsv and writes the resulting phylogenetic tree to phylo_tree.newick in standard Newick format.
Pipeline with IDMatrix
# First create identity matrix
idmatrix.sh in=sequences.fasta out=identity_matrix.tsv
# Then build tree from matrix
idtree.sh in=identity_matrix.tsv out=phylo_tree.newick
Complete workflow showing how to generate an identity matrix from sequences and then construct a phylogenetic tree from that matrix.
High Memory Usage
idtree.sh in=large_matrix.tsv out=tree.newick -Xmx8g
For large identity matrices with many sequences, increase memory allocation to 8GB to handle the tree construction process efficiently.
Algorithm Details
IDTree implements hierarchical agglomerative clustering using the IDNode.makeTree() method. The tree construction follows a greedy priority queue-based algorithm that builds phylogenetic trees from sequence identity matrices through iterative node merging.
Tree Construction Algorithm
The core algorithm follows these steps:
- Node Initialization: Each sequence in the identity matrix becomes a leaf node (IDNode) containing its identity scores to all other sequences
- Priority Queue Management: Nodes are maintained in a priority queue ordered by their maximum identity score to remaining nodes
- Iterative Clustering: The algorithm repeatedly selects the node with the highest maximum similarity and merges it with its most similar partner
- Distance Calculation: Branch lengths are calculated based on identity score differences, with the formula: length = max(left.max, right.max) - parent.max
- Newick Output: The final tree is serialized to standard Newick format with calculated branch lengths
Data Structures
The implementation uses specialized data structures for efficiency:
- IDNode Class: Represents both leaf and internal nodes, containing identity arrays, parent/child relationships, and BitSet tracking for processed sequences
- Priority Queue: Maintains nodes sorted by maximum identity score for efficient selection of next clustering candidates
- BitSet Tracking: Uses BitSet data structures to efficiently track which sequences have been clustered together
Memory Characteristics
Memory usage scales quadratically with the number of input sequences due to the identity matrix storage. For N sequences:
- Matrix Storage: O(N²) space for storing all pairwise identity scores
- Node Storage: O(N) space for tree nodes, with each internal node requiring additional array storage
- Working Memory: Additional memory for priority queue operations and BitSet tracking
Input Format Requirements
The identity matrix must follow specific formatting requirements:
- TSV Format: Tab-separated values with sequence names in the first column
- Symmetric Matrix: Identity scores should be symmetric (score[i][j] = score[j][i])
- Numerical Values: All identity scores must be valid double-precision numbers
- Header Handling: Lines starting with '#' are treated as comments and ignored
Tree Quality and Interpretation
The resulting Newick tree has specific characteristics:
- Branch Lengths: Represent identity score differences, with shorter branches indicating higher similarity
- Topology: Reflects hierarchical clustering based on maximum identity scores at each step
- Name Sanitization: Special Newick characters (parentheses, colons, commas, semicolons, whitespace) are replaced with underscores
- Precision: Branch lengths are formatted to 4 decimal places for balance between precision and readability
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org