GI2Ancestors
Finds NCBI taxIDs of common ancestors of gi numbers. Processes input files containing sequence names and GI numbers to determine the lowest common ancestors in the NCBI taxonomic tree.
Basic Usage
gi2ancestors.sh in=<input file> out=<output file>
Input should be formatted with sequence names followed by comma-separated GI numbers:
ori15 gi|818890693,gi|818890691,gi|837354594
Parameters
Parameters are organized by their function in the ancestor finding process. The tool requires NCBI taxonomic data files to operate correctly.
Standard parameters
- in=<file>
- Input text file with sequence names and GI numbers. Each line should contain a name followed by tab-separated comma-delimited GI numbers in the format: gi|number,gi|number,...
- out=<file>
- Output file. Results will include ancestor taxIDs, majority consensus, and complete taxonomic lineages for each input sequence.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false.
Taxonomic Data Files
- table=auto
- (gi, gitable) Path to GI-to-taxID mapping table file. Use "auto" to use the default location. Required for converting GI numbers to NCBI taxonomic IDs.
- tree=auto
- (taxtree) Path to NCBI taxonomic tree file. Use "auto" to use the default location. Contains the hierarchical relationships between taxonomic nodes.
- invalid=<file>
- Output file for sequences with invalid or unmappable GI numbers. Lines that cannot be processed will be written here instead of the main output.
Processing Options
- lines=-1
- Maximum number of lines to process from input file. Use -1 for unlimited processing. Useful for testing with large datasets.
- verbose=f
- Enable verbose output for debugging. Shows detailed processing information including file I/O operations and intermediate results.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 10g.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Ancestor Finding
gi2ancestors.sh in=sequences.txt out=ancestors.txt
Processes a file containing sequence names and GI numbers to find their common ancestors.
With Custom Taxonomic Files
gi2ancestors.sh in=sequences.txt out=ancestors.txt tree=/path/to/taxtree.taxtree table=/path/to/gitable.int1d.gz
Uses custom taxonomic tree and GI table files instead of the default locations.
Processing with Invalid Output
gi2ancestors.sh in=sequences.txt out=ancestors.txt invalid=invalid_sequences.txt
Separates valid results from sequences with unmappable GI numbers.
Limited Processing
gi2ancestors.sh in=large_file.txt out=sample_ancestors.txt lines=1000
Processes only the first 1000 lines of input, useful for testing or sampling large datasets.
Algorithm Details
Input Processing
The tool expects tab-delimited input where the first column contains sequence identifiers and subsequent columns contain comma-separated GI numbers. GI numbers can be prefixed with "gi|" or provided as raw numbers.
GI to TaxID Conversion
GI numbers are parsed from input using Parse.parseInt() and converted to NCBI taxonomic IDs via GiToTaxid.getID() hash table lookup. The tool handles both "gi|number" prefixed and raw number formats. Invalid GI numbers (returning -1 from lookup) are filtered out during processing.
Ancestor Finding Algorithm
The tool implements two complementary algorithms for finding taxonomic relationships:
Common Ancestor Detection
The findAncestor() method traverses the taxonomic tree to find the lowest common ancestor (LCA). The algorithm uses TaxTree.commonAncestor() calls to iteratively refine the result:
- Initializes with the first taxID as the ancestor candidate
- For each subsequent taxID, calls tree.commonAncestor(ancestor, id) to find shared parent
- Updates the ancestor variable to the most specific common node (higher in hierarchy)
- Continues until all taxIDs are processed or ancestor becomes -1 (no common ancestor)
Majority Consensus
The findMajority() method calculates consensus for sequences with 3+ taxIDs using a voting algorithm. For fewer than 3 taxIDs, it returns the common ancestor result:
- Each taxID votes by calling tree.percolateUp(node, 1) to add weight to all ancestors in its lineage
- Algorithm traverses from each taxID toward root, finding nodes with countSum >= majority threshold (size/2+1)
- Selects the node with majority support and lowest levelExtended (most specific taxonomic level)
- Resets vote counts using tree.percolateUp(node, -1) to restore tree state
- Falls back to lifeNode if no majority consensus is found
Output Format
The output uses tab-delimited format with a header line "#Name\tAncestor\tMajority\tTaxonomy...". For each processed sequence:
- Summary line: Sequence name, ancestor taxID, majority taxID, followed by tab-delimited taxonomic lineage from majority result
- Individual lineages: One line per valid input taxID showing complete taxonomic path from root to species level
- Lineage construction: fillTraversal() builds paths by walking from taxID to root, writeTraversal() outputs node names in root-to-leaf order
Memory Management
The tool uses TaxTree.loadTaxTree() to load the complete NCBI taxonomic hierarchy into memory with TaxNode objects linked by parent-child relationships. GiToTaxid.initialize() loads the GI-to-taxID mapping table. Default memory allocation is 10GB (configurable via -Xmx). The lifeNode reference provides the root node for traversal operations.
Performance Characteristics
Processing time is linear with respect to input size, with the dominant factors being:
- GI-to-taxID lookup operations (O(1) with hash table)
- Tree traversal for ancestor finding (O(depth) per taxID pair)
- Lineage reconstruction (O(depth) per result)
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org