GI2Ancestors

Basic Usage

gi2ancestors.sh in=<input file> out=<output file>

Input should be formatted with sequence names followed by comma-separated GI numbers:

ori15	gi|818890693,gi|818890691,gi|837354594

Parameters

Parameters are organized by their function in the ancestor finding process. The tool requires NCBI taxonomic data files to operate correctly.

Standard parameters

in=<file>: Input text file with sequence names and GI numbers. Each line should contain a name followed by tab-separated comma-delimited GI numbers in the format: gi|number,gi|number,...
out=<file>: Output file. Results will include ancestor taxIDs, majority consensus, and complete taxonomic lineages for each input sequence.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false.

Taxonomic Data Files

table=auto: (gi, gitable) Path to GI-to-taxID mapping table file. Use "auto" to use the default location. Required for converting GI numbers to NCBI taxonomic IDs.
tree=auto: (taxtree) Path to NCBI taxonomic tree file. Use "auto" to use the default location. Contains the hierarchical relationships between taxonomic nodes.
invalid=<file>: Output file for sequences with invalid or unmappable GI numbers. Lines that cannot be processed will be written here instead of the main output.

Processing Options

lines=-1: Maximum number of lines to process from input file. Use -1 for unlimited processing. Useful for testing with large datasets.
verbose=f: Enable verbose output for debugging. Shows detailed processing information including file I/O operations and intermediate results.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 10g.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Ancestor Finding

gi2ancestors.sh in=sequences.txt out=ancestors.txt

Processes a file containing sequence names and GI numbers to find their common ancestors.

With Custom Taxonomic Files

gi2ancestors.sh in=sequences.txt out=ancestors.txt tree=/path/to/taxtree.taxtree table=/path/to/gitable.int1d.gz

Uses custom taxonomic tree and GI table files instead of the default locations.

Processing with Invalid Output

gi2ancestors.sh in=sequences.txt out=ancestors.txt invalid=invalid_sequences.txt

Separates valid results from sequences with unmappable GI numbers.

Limited Processing

gi2ancestors.sh in=large_file.txt out=sample_ancestors.txt lines=1000

Processes only the first 1000 lines of input, useful for testing or sampling large datasets.

Algorithm Details

Input Processing

The tool expects tab-delimited input where the first column contains sequence identifiers and subsequent columns contain comma-separated GI numbers. GI numbers can be prefixed with "gi|" or provided as raw numbers.

GI to TaxID Conversion

GI numbers are parsed from input using Parse.parseInt() and converted to NCBI taxonomic IDs via GiToTaxid.getID() hash table lookup. The tool handles both "gi|number" prefixed and raw number formats. Invalid GI numbers (returning -1 from lookup) are filtered out during processing.

Ancestor Finding Algorithm

The tool implements two complementary algorithms for finding taxonomic relationships:

Common Ancestor Detection

The findAncestor() method traverses the taxonomic tree to find the lowest common ancestor (LCA). The algorithm uses TaxTree.commonAncestor() calls to iteratively refine the result:

Initializes with the first taxID as the ancestor candidate
For each subsequent taxID, calls tree.commonAncestor(ancestor, id) to find shared parent
Updates the ancestor variable to the most specific common node (higher in hierarchy)
Continues until all taxIDs are processed or ancestor becomes -1 (no common ancestor)

Majority Consensus

The findMajority() method calculates consensus for sequences with 3+ taxIDs using a voting algorithm. For fewer than 3 taxIDs, it returns the common ancestor result:

Each taxID votes by calling tree.percolateUp(node, 1) to add weight to all ancestors in its lineage
Algorithm traverses from each taxID toward root, finding nodes with countSum >= majority threshold (size/2+1)
Selects the node with majority support and lowest levelExtended (most specific taxonomic level)
Resets vote counts using tree.percolateUp(node, -1) to restore tree state
Falls back to lifeNode if no majority consensus is found

Output Format

The output uses tab-delimited format with a header line "#Name\tAncestor\tMajority\tTaxonomy...". For each processed sequence:

Summary line: Sequence name, ancestor taxID, majority taxID, followed by tab-delimited taxonomic lineage from majority result
Individual lineages: One line per valid input taxID showing complete taxonomic path from root to species level
Lineage construction: fillTraversal() builds paths by walking from taxID to root, writeTraversal() outputs node names in root-to-leaf order

Memory Management

The tool uses TaxTree.loadTaxTree() to load the complete NCBI taxonomic hierarchy into memory with TaxNode objects linked by parent-child relationships. GiToTaxid.initialize() loads the GI-to-taxID mapping table. Default memory allocation is 10GB (configurable via -Xmx). The lifeNode reference provides the root node for traversal operations.

Performance Characteristics

Processing time is linear with respect to input size, with the dominant factors being:

GI-to-taxID lookup operations (O(1) with hash table)
Tree traversal for ancestor finding (O(depth) per taxID pair)
Lineage reconstruction (O(depth) per result)

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org