IDTree

Script: idtree.sh Package: tax Class: IDTree.java

Makes a Newick tree from an identity matrix. Intended for use with matrices created by idmatrix.sh.

Basic Usage

idtree.sh in=<input file> out=<output file>

IDTree reads an identity matrix in TSV (tab-separated values) format and constructs a phylogenetic tree in Newick format. The tool is designed to work with identity matrices generated by the idmatrix.sh tool, creating hierarchical clustering trees based on sequence similarity data.

Parameters

IDTree parameters are organized into standard file handling parameters and Java runtime parameters. The tool has a simple parameter set focused on input/output specification and memory management.

Standard parameters

in=<file>
Identity matrix in TSV format. The input file should contain a symmetric matrix where each row represents a sequence and each column represents the identity score to another sequence. The first column should contain sequence names, and subsequent columns contain numerical identity values.
out=<file>
Newick tree output. The output file will contain the constructed phylogenetic tree in standard Newick format, terminated with a semicolon. If not specified, output goes to stdout.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. When set to true, existing output files will be overwritten without warning. Default: false

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 800m for IDTree.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing hanging processes when memory is exhausted during tree construction.
-da
Disable assertions. This can provide a small performance improvement in production environments by skipping internal consistency checks.

Examples

Basic Tree Construction

idtree.sh in=identity_matrix.tsv out=phylo_tree.newick

Reads an identity matrix from identity_matrix.tsv and writes the resulting phylogenetic tree to phylo_tree.newick in standard Newick format.

Pipeline with IDMatrix

# First create identity matrix
idmatrix.sh in=sequences.fasta out=identity_matrix.tsv

# Then build tree from matrix
idtree.sh in=identity_matrix.tsv out=phylo_tree.newick

Complete workflow showing how to generate an identity matrix from sequences and then construct a phylogenetic tree from that matrix.

High Memory Usage

idtree.sh in=large_matrix.tsv out=tree.newick -Xmx8g

For large identity matrices with many sequences, increase memory allocation to 8GB to handle the tree construction process efficiently.

Algorithm Details

IDTree implements hierarchical agglomerative clustering using the IDNode.makeTree() method. The tree construction follows a greedy priority queue-based algorithm that builds phylogenetic trees from sequence identity matrices through iterative node merging.

Tree Construction Algorithm

The core algorithm follows these steps:

Data Structures

The implementation uses specialized data structures for efficiency:

Memory Characteristics

Memory usage scales quadratically with the number of input sequences due to the identity matrix storage. For N sequences:

Input Format Requirements

The identity matrix must follow specific formatting requirements:

Tree Quality and Interpretation

The resulting Newick tree has specific characteristics:

Support

For questions and support: