GI2TaxID

Script: gi2taxid.sh Package: tax Class: RenameGiToTaxid.java

Converts sequence headers from GI numbers, accession numbers, or organism names to NCBI taxonomy IDs, enabling efficient downstream taxonomy processing and eliminating the need for large lookup tables in subsequent BBTools operations.

Overview

GI2TAXID is a fundamental component of BBTools' taxonomy processing pipeline that transforms sequence identifiers into the preferred tid|123| format. This conversion is crucial for efficient memory usage and performance in downstream taxonomy operations like filtering, binning, and abundance analysis.

Strategic Importance

Converting sequences to taxonomy ID format provides several critical advantages:

Transformation Examples

GI Number Conversion

# Input header:
>gi|123456789|ref|NZ_CP001234.1| Escherichia coli strain K-12

# Output header:
>tid|511145|gi|123456789|ref|NZ_CP001234.1| Escherichia coli strain K-12

Accession Conversion

# Input header:
>NZ_CP001234.1 Escherichia coli strain K-12 complete genome

# Output header:
>tid|511145|NZ_CP001234.1 Escherichia coli strain K-12 complete genome

Organism Name Conversion

# Input header:
>Escherichia_coli sequence

# Output header:
>tid|511145|Escherichia_coli sequence

Basic Usage

gi2taxid.sh in=<file> out=<file> [server | tree=auto table=auto accession=auto]

Choose between server mode (online lookup, no downloads required) or local mode (requires downloading and processing NCBI taxonomy files).

Parameters

Input/Output Parameters

in=<file>
Input sequences in FASTA format (required). Can be comma-delimited list or space-delimited filenames. Example: gi2taxid.sh x.fa y.fa z.fa out=tid.fa tree=auto table=auto
out=<file>
Output file for renamed sequences with taxonomy IDs.
invalid=<file>
Destination for sequences where no taxonomy ID could be determined.

Processing Options

keepall=t
Keep sequences with no taxonomy ID in normal output rather than filtering them out. Default: true
prefix=t
Append the taxonomy ID as a prefix while preserving the original header. Default: true
title=tid
Set the identifier prefix in output headers (e.g., tid, ncbi, taxid). Default: tid
silva=f
Enable parsing of Silva database header format with semicolon-delimited taxonomic information. Default: false
shrinknames=f
Replace concatenated headers (common in nr/nt databases) with just the first header to reduce file size. Default: false
deleteinvalid=f
Delete the entire output file if any invalid headers are encountered (strict quality control). Default: false

Taxonomy Data Source

server=f
Use JGI's online taxonomy server (https://taxonomy.jgi.doe.gov) instead of local files. Recommended for accession-based sequences. Requires internet connection but eliminates need for local taxonomy downloads. Default: false
tree=
Path to taxonomy tree file (.taxtree.gz). Use 'auto' to use default location set by taxpath. Required for organism name lookups.
gi=
Path to GI table file for GI number to taxonomy ID conversion. Use 'auto' for default location. Needed only for gi|123| format headers.
accession=
Comma-delimited NCBI accession-to-taxonomy files. Use 'auto' for default. Required for accession number conversion. Can consume 40GB+ RAM.

Advanced Parameters

maxbadheaders=
Maximum number of invalid headers before terminating processing. Useful for quality control.
badheaders=<file>
File to write problematic headers for debugging purposes.
warn=t
Print warnings for bad headers to stderr. Default: true
mode=
Processing mode: accession (0), gi (1), header (2), unite (3). Usually auto-detected from input format.

Compression Parameters

ziplevel=2
Compression level for gzipped output files (1-9). Default: 2
pigz=t
Use parallel gzip (pigz) for faster compression. Requires pigz installation. Default: true

Java Parameters

-Xmx
Java heap memory allocation. For accession processing, recommend 40GB+ (e.g., -Xmx63g). Default: 7g
-eoom
Exit on out-of-memory exception. Requires Java 8u92+.
-da
Disable Java assertions for slight performance improvement.

Usage Examples

Server Mode (Recommended for Most Users)

gi2taxid.sh in=sequences.fa out=renamed.fa server

Uses JGI's online taxonomy server for fast accession lookups without requiring local taxonomy files. Works best with RefSeq/GenBank sequences.

Local Mode with Auto-Detection

gi2taxid.sh in=sequences.fa out=renamed.fa tree=auto table=auto accession=auto taxpath=/usr/tax/

Uses local taxonomy files downloaded with fetchTaxonomy.sh. Requires significant RAM (40GB+) for accession processing but works offline.

Large-Scale NT Database Processing

gi2taxid.sh -Xmx63g in=nt.fa.gz out=renamed.fa.gz tree=auto accession=auto table=auto shrinknames taxpath=/path/to/taxonomy

Processes NCBI's nt database with concatenated headers, using shrinknames to reduce output size and high memory allocation.

Silva Database Format

gi2taxid.sh in=silva_sequences.fa out=renamed.fa silva=t tree=auto taxpath=/usr/tax/

Processes Silva 16S/18S database with specialized header parsing for semicolon-delimited taxonomy.

Multiple Input Files

gi2taxid.sh bacteria.fa archaea.fa viruses.fa out=all_renamed.fa server

Combines multiple FASTA files into a single output with taxonomy ID headers for streamlined downstream processing.

Quality Control with Invalid Sequence Separation

gi2taxid.sh in=input.fa out=valid.fa invalid=unresolved.fa server keepall=f

Separates sequences with resolved taxonomy IDs from those that couldn't be assigned, enabling quality assessment.

Workflow Integration

GI2TAXID serves as the foundation for BBTools taxonomy workflows:

Complete Taxonomy Pipeline

# Step 1: Convert identifiers to taxonomy IDs
gi2taxid.sh in=sequences.fa out=tid_sequences.fa server

# Step 2: Filter by taxonomic group
filterbytaxa.sh in=tid_sequences.fa out=bacteria_only.fa names=Bacteria level=superkingdom tree=auto

# Step 3: Bin sequences by family
splitbytaxa.sh in=bacteria_only.fa out=family_%.fa level=family tree=auto

# Step 4: Generate abundance profiles
seal.sh in=reads.fq ref=tid_sequences.fa out=abundance.txt k=31 ambiguous=random tree=auto level=species

This workflow demonstrates how GI2TAXID enables efficient downstream taxonomy operations by eliminating the need for large lookup tables in each subsequent step.

Algorithm Details

Multi-Modal Processing Architecture

RenameGiToTaxid implements four specialized processing modes through the processInner() and processInner_server() methods:

Server vs Local Processing Strategies

Local Processing: TaxTree.loadTaxTree() constructs hierarchical in-memory structures with nameMap HashMap for organism lookups. The tree maintains parent-child relationships through TaxNode objects enabling lineage traversal. AccessionToTaxid class provides hash-based accession mapping requiring substantial memory allocation.

Server Processing: Implements batched HTTP requests through updateHeadersFromServer() with configurable maxStoredBytes=10MB buffer threshold. translateToTaxIDs() handles TaxClient communication with exponential backoff retry logic (Thread.sleep(500*(i+2))) for up to 10 attempts. Queries use ByteBuilder comma-separated concatenation for efficient batch processing.

Header Processing Pipeline

The processInner() method executes multi-stage header transformation:

Memory Management and Performance

Default allocation uses calcXmx() with freeRam(7000m, 84) setting -Xmx7g/-Xms7g heap parameters. For accession processing, 40GB+ allocation is recommended. HashArray1D(256000, -1L, true) provides taxonomy occurrence counting with 256K initial capacity and automatic resizing. Server mode batching limits memory through maxStoredBytes thresholds before processing accumulated headers.

Error Handling and Quality Control

Output Format Specification

Header reformatting uses configurable byte arrays with default title=">tid|".getBytes(). Prefix mode appends original headers after taxonomy IDs via character-by-character copying. GFF format processing replaces the first tab-delimited column with "tid|[taxid]" format. Count mode employs HashArray1D.increment() for taxonomy occurrence frequency tracking in output headers.

Performance Considerations

Memory Requirements

Processing Speed

Recommendations

Troubleshooting

Common Issues

Out of Memory Errors
Increase -Xmx parameter (recommend 40GB+ for accession processing) or switch to server mode for memory-limited systems.
Server Connection Failures
Check internet connectivity. Tool implements retry logic with exponential backoff, but persistent failures may require switching to local mode.
High Invalid Sequence Count
Review input format compatibility. Use badheaders parameter to examine problematic sequences. Consider adjusting parsing mode or preprocessing headers.
Slow Processing
For large datasets with accessions, ensure sufficient RAM allocation. Consider server mode for initial processing or preprocessing to reduce dataset size.

Support

For questions and support: