GI2TaxID

Overview

GI2TAXID is a fundamental component of BBTools' taxonomy processing pipeline that transforms sequence identifiers into the preferred tid|123| format. This conversion is crucial for efficient memory usage and performance in downstream taxonomy operations like filtering, binning, and abundance analysis.

Strategic Importance

Converting sequences to taxonomy ID format provides several critical advantages:

Memory Efficiency: Eliminates the need to load 40GB+ accession tables in subsequent operations
Processing Speed: Direct taxonomy ID lookups are orders of magnitude faster than accession parsing
Pipeline Integration: Enables seamless use with FilterByTaxa, SplitByTaxa, Seal, and BBSketch
Server Independence: Once converted, downstream tools don't require network access or local taxonomy files

Transformation Examples

GI Number Conversion

# Input header:
>gi|123456789|ref|NZ_CP001234.1| Escherichia coli strain K-12

# Output header:
>tid|511145|gi|123456789|ref|NZ_CP001234.1| Escherichia coli strain K-12

Accession Conversion

# Input header:
>NZ_CP001234.1 Escherichia coli strain K-12 complete genome

# Output header:
>tid|511145|NZ_CP001234.1 Escherichia coli strain K-12 complete genome

Organism Name Conversion

# Input header:
>Escherichia_coli sequence

# Output header:
>tid|511145|Escherichia_coli sequence

Basic Usage

gi2taxid.sh in=<file> out=<file> [server | tree=auto table=auto accession=auto]

Choose between server mode (online lookup, no downloads required) or local mode (requires downloading and processing NCBI taxonomy files).

Parameters

Input/Output Parameters

in=<file>: Input sequences in FASTA format (required). Can be comma-delimited list or space-delimited filenames. Example: gi2taxid.sh x.fa y.fa z.fa out=tid.fa tree=auto table=auto
out=<file>: Output file for renamed sequences with taxonomy IDs.
invalid=<file>: Destination for sequences where no taxonomy ID could be determined.

Processing Options

keepall=t: Keep sequences with no taxonomy ID in normal output rather than filtering them out. Default: true
prefix=t: Append the taxonomy ID as a prefix while preserving the original header. Default: true
title=tid: Set the identifier prefix in output headers (e.g., tid, ncbi, taxid). Default: tid
silva=f: Enable parsing of Silva database header format with semicolon-delimited taxonomic information. Default: false
shrinknames=f: Replace concatenated headers (common in nr/nt databases) with just the first header to reduce file size. Default: false
deleteinvalid=f: Delete the entire output file if any invalid headers are encountered (strict quality control). Default: false

Taxonomy Data Source

server=f: Use JGI's online taxonomy server (https://taxonomy.jgi.doe.gov) instead of local files. Recommended for accession-based sequences. Requires internet connection but eliminates need for local taxonomy downloads. Default: false
tree=: Path to taxonomy tree file (.taxtree.gz). Use 'auto' to use default location set by taxpath. Required for organism name lookups.
gi=: Path to GI table file for GI number to taxonomy ID conversion. Use 'auto' for default location. Needed only for gi|123| format headers.
accession=: Comma-delimited NCBI accession-to-taxonomy files. Use 'auto' for default. Required for accession number conversion. Can consume 40GB+ RAM.

Advanced Parameters

maxbadheaders=: Maximum number of invalid headers before terminating processing. Useful for quality control.
badheaders=<file>: File to write problematic headers for debugging purposes.
warn=t: Print warnings for bad headers to stderr. Default: true
mode=: Processing mode: accession (0), gi (1), header (2), unite (3). Usually auto-detected from input format.

Compression Parameters

ziplevel=2: Compression level for gzipped output files (1-9). Default: 2
pigz=t: Use parallel gzip (pigz) for faster compression. Requires pigz installation. Default: true

Java Parameters

-Xmx: Java heap memory allocation. For accession processing, recommend 40GB+ (e.g., -Xmx63g). Default: 7g
-eoom: Exit on out-of-memory exception. Requires Java 8u92+.
-da: Disable Java assertions for slight performance improvement.

Usage Examples

Server Mode (Recommended for Most Users)

gi2taxid.sh in=sequences.fa out=renamed.fa server

Uses JGI's online taxonomy server for fast accession lookups without requiring local taxonomy files. Works best with RefSeq/GenBank sequences.

Local Mode with Auto-Detection

gi2taxid.sh in=sequences.fa out=renamed.fa tree=auto table=auto accession=auto taxpath=/usr/tax/

Uses local taxonomy files downloaded with fetchTaxonomy.sh. Requires significant RAM (40GB+) for accession processing but works offline.

Large-Scale NT Database Processing

gi2taxid.sh -Xmx63g in=nt.fa.gz out=renamed.fa.gz tree=auto accession=auto table=auto shrinknames taxpath=/path/to/taxonomy

Processes NCBI's nt database with concatenated headers, using shrinknames to reduce output size and high memory allocation.

Silva Database Format

gi2taxid.sh in=silva_sequences.fa out=renamed.fa silva=t tree=auto taxpath=/usr/tax/

Processes Silva 16S/18S database with specialized header parsing for semicolon-delimited taxonomy.

Multiple Input Files

gi2taxid.sh bacteria.fa archaea.fa viruses.fa out=all_renamed.fa server

Combines multiple FASTA files into a single output with taxonomy ID headers for streamlined downstream processing.

Quality Control with Invalid Sequence Separation

gi2taxid.sh in=input.fa out=valid.fa invalid=unresolved.fa server keepall=f

Separates sequences with resolved taxonomy IDs from those that couldn't be assigned, enabling quality assessment.

Workflow Integration

GI2TAXID serves as the foundation for BBTools taxonomy workflows:

Complete Taxonomy Pipeline

# Step 1: Convert identifiers to taxonomy IDs
gi2taxid.sh in=sequences.fa out=tid_sequences.fa server

# Step 2: Filter by taxonomic group
filterbytaxa.sh in=tid_sequences.fa out=bacteria_only.fa names=Bacteria level=superkingdom tree=auto

# Step 3: Bin sequences by family
splitbytaxa.sh in=bacteria_only.fa out=family_%.fa level=family tree=auto

# Step 4: Generate abundance profiles
seal.sh in=reads.fq ref=tid_sequences.fa out=abundance.txt k=31 ambiguous=random tree=auto level=species

This workflow demonstrates how GI2TAXID enables efficient downstream taxonomy operations by eliminating the need for large lookup tables in each subsequent step.

Algorithm Details

Multi-Modal Processing Architecture

RenameGiToTaxid implements four specialized processing modes through the processInner() and processInner_server() methods:

Accession Mode (0): Extracts accession identifiers until space/period delimiters using appendHeaderLine(). Validates format with looksLikeRealAccession() checking 4-18 character length and optional version dot at position n-2. Uses AccessionToTaxid.load() to build HashMap lookup tables from NCBI files.
GI Mode (1): Parses GI numbers by extracting characters until space/pipe delimiters. Utilizes GiToTaxid.initialize() to build integer arrays for binary gi-to-taxid mapping.
Header Mode (2): Processes complete organism names via TaxTree.parseNodeFromHeader() using nameMap HashMap for organism-to-TaxNode matching.
Unite Mode (3): Specialized parsing for UNITE database pipe-delimited headers with TaxTree.UNITE_MODE=true enabling custom taxonomic name extraction.

Server vs Local Processing Strategies

Local Processing: TaxTree.loadTaxTree() constructs hierarchical in-memory structures with nameMap HashMap for organism lookups. The tree maintains parent-child relationships through TaxNode objects enabling lineage traversal. AccessionToTaxid class provides hash-based accession mapping requiring substantial memory allocation.

Server Processing: Implements batched HTTP requests through updateHeadersFromServer() with configurable maxStoredBytes=10MB buffer threshold. translateToTaxIDs() handles TaxClient communication with exponential backoff retry logic (Thread.sleep(500*(i+2))) for up to 10 attempts. Queries use ByteBuilder comma-separated concatenation for efficient batch processing.

Header Processing Pipeline

The processInner() method executes multi-stage header transformation:

Existing ID Detection: Tools.startsWith() identifies ">tid|" and ">ncbi|" prefixes, advancing the initial pointer past existing taxonomy identifiers to prevent double-processing.
Name Compression: shrinkNames=true scans for SOH (Start of Header, byte 1) characters to truncate concatenated nr/nt headers, setting terminal position at first SOH occurrence for space efficiency.
Format-Specific Parsing: Silva mode modifies TaxTree.parseNodeFromHeader() behavior for semicolon-delimited organism hierarchies commonly found in Silva databases.

Memory Management and Performance

Default allocation uses calcXmx() with freeRam(7000m, 84) setting -Xmx7g/-Xms7g heap parameters. For accession processing, 40GB+ allocation is recommended. HashArray1D(256000, -1L, true) provides taxonomy occurrence counting with 256K initial capacity and automatic resizing. Server mode batching limits memory through maxStoredBytes thresholds before processing accumulated headers.

Error Handling and Quality Control

Invalid Detection: Sequences receiving taxonomy ID -1 are flagged invalid. dumpBuffer() checks for invalidTitle (">tid|-1") markers and routes invalid sequences to separate output streams.
Accession Validation: looksLikeRealAccession() enforces format constraints including character validation with Tools.isLetterOrDigit() plus underscore, hyphen, and dot allowances.
Server Response Validation: translateToTaxIDs() validates response length matching query term counts, implementing retry logic for null responses or length mismatches.
Threshold Controls: maxInvalidHeaders parameter triggers KillSwitch.kill() when error limits are exceeded, providing quality control for large-scale processing.

Output Format Specification

Header reformatting uses configurable byte arrays with default title=">tid|".getBytes(). Prefix mode appends original headers after taxonomy IDs via character-by-character copying. GFF format processing replaces the first tab-delimited column with "tid|[taxid]" format. Count mode employs HashArray1D.increment() for taxonomy occurrence frequency tracking in output headers.

Performance Considerations

Memory Requirements

GI-only processing: ~500MB (taxonomy tree only)
Organism names: ~1GB (includes nameMap)
Accession processing: 40-60GB (full accession tables)
Server mode: Minimal local memory (~100MB)

Processing Speed

Server mode: Network-limited, but no preprocessing time
Local mode: Very fast after initial 5-10 minute loading period
GI processing: Fastest local option
Accession processing: Slower due to hash lookups

Recommendations

Small datasets (<100K sequences): Use server mode
Large datasets: Use local mode with adequate RAM
Repeated processing: Convert to tid format once, reuse for all downstream analyses
Memory-constrained systems: Process in smaller batches or use server mode

Troubleshooting

Common Issues

Out of Memory Errors: Increase -Xmx parameter (recommend 40GB+ for accession processing) or switch to server mode for memory-limited systems.
Server Connection Failures: Check internet connectivity. Tool implements retry logic with exponential backoff, but persistent failures may require switching to local mode.
High Invalid Sequence Count: Review input format compatibility. Use badheaders parameter to examine problematic sequences. Consider adjusting parsing mode or preprocessing headers.
Slow Processing: For large datasets with accessions, ensure sufficient RAM allocation. Consider server mode for initial processing or preprocessing to reduce dataset size.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Taxonomy Guide: Comprehensive guide covering taxonomy workflow setup and optimization