GI2TaxID
Converts sequence headers from GI numbers, accession numbers, or organism names to NCBI taxonomy IDs, enabling efficient downstream taxonomy processing and eliminating the need for large lookup tables in subsequent BBTools operations.
Overview
GI2TAXID is a fundamental component of BBTools' taxonomy processing pipeline that transforms sequence identifiers into the preferred tid|123|
format. This conversion is crucial for efficient memory usage and performance in downstream taxonomy operations like filtering, binning, and abundance analysis.
Strategic Importance
Converting sequences to taxonomy ID format provides several critical advantages:
- Memory Efficiency: Eliminates the need to load 40GB+ accession tables in subsequent operations
- Processing Speed: Direct taxonomy ID lookups are orders of magnitude faster than accession parsing
- Pipeline Integration: Enables seamless use with FilterByTaxa, SplitByTaxa, Seal, and BBSketch
- Server Independence: Once converted, downstream tools don't require network access or local taxonomy files
Transformation Examples
GI Number Conversion
# Input header:
>gi|123456789|ref|NZ_CP001234.1| Escherichia coli strain K-12
# Output header:
>tid|511145|gi|123456789|ref|NZ_CP001234.1| Escherichia coli strain K-12
Accession Conversion
# Input header:
>NZ_CP001234.1 Escherichia coli strain K-12 complete genome
# Output header:
>tid|511145|NZ_CP001234.1 Escherichia coli strain K-12 complete genome
Organism Name Conversion
# Input header:
>Escherichia_coli sequence
# Output header:
>tid|511145|Escherichia_coli sequence
Basic Usage
gi2taxid.sh in=<file> out=<file> [server | tree=auto table=auto accession=auto]
Choose between server mode (online lookup, no downloads required) or local mode (requires downloading and processing NCBI taxonomy files).
Parameters
Input/Output Parameters
- in=<file>
- Input sequences in FASTA format (required). Can be comma-delimited list or space-delimited filenames. Example:
gi2taxid.sh x.fa y.fa z.fa out=tid.fa tree=auto table=auto
- out=<file>
- Output file for renamed sequences with taxonomy IDs.
- invalid=<file>
- Destination for sequences where no taxonomy ID could be determined.
Processing Options
- keepall=t
- Keep sequences with no taxonomy ID in normal output rather than filtering them out. Default: true
- prefix=t
- Append the taxonomy ID as a prefix while preserving the original header. Default: true
- title=tid
- Set the identifier prefix in output headers (e.g., tid, ncbi, taxid). Default: tid
- silva=f
- Enable parsing of Silva database header format with semicolon-delimited taxonomic information. Default: false
- shrinknames=f
- Replace concatenated headers (common in nr/nt databases) with just the first header to reduce file size. Default: false
- deleteinvalid=f
- Delete the entire output file if any invalid headers are encountered (strict quality control). Default: false
Taxonomy Data Source
- server=f
- Use JGI's online taxonomy server (https://taxonomy.jgi.doe.gov) instead of local files. Recommended for accession-based sequences. Requires internet connection but eliminates need for local taxonomy downloads. Default: false
- tree=
- Path to taxonomy tree file (.taxtree.gz). Use 'auto' to use default location set by taxpath. Required for organism name lookups.
- gi=
- Path to GI table file for GI number to taxonomy ID conversion. Use 'auto' for default location. Needed only for gi|123| format headers.
- accession=
- Comma-delimited NCBI accession-to-taxonomy files. Use 'auto' for default. Required for accession number conversion. Can consume 40GB+ RAM.
Advanced Parameters
- maxbadheaders=
- Maximum number of invalid headers before terminating processing. Useful for quality control.
- badheaders=<file>
- File to write problematic headers for debugging purposes.
- warn=t
- Print warnings for bad headers to stderr. Default: true
- mode=
- Processing mode: accession (0), gi (1), header (2), unite (3). Usually auto-detected from input format.
Compression Parameters
- ziplevel=2
- Compression level for gzipped output files (1-9). Default: 2
- pigz=t
- Use parallel gzip (pigz) for faster compression. Requires pigz installation. Default: true
Java Parameters
- -Xmx
- Java heap memory allocation. For accession processing, recommend 40GB+ (e.g., -Xmx63g). Default: 7g
- -eoom
- Exit on out-of-memory exception. Requires Java 8u92+.
- -da
- Disable Java assertions for slight performance improvement.
Usage Examples
Server Mode (Recommended for Most Users)
gi2taxid.sh in=sequences.fa out=renamed.fa server
Uses JGI's online taxonomy server for fast accession lookups without requiring local taxonomy files. Works best with RefSeq/GenBank sequences.
Local Mode with Auto-Detection
gi2taxid.sh in=sequences.fa out=renamed.fa tree=auto table=auto accession=auto taxpath=/usr/tax/
Uses local taxonomy files downloaded with fetchTaxonomy.sh. Requires significant RAM (40GB+) for accession processing but works offline.
Large-Scale NT Database Processing
gi2taxid.sh -Xmx63g in=nt.fa.gz out=renamed.fa.gz tree=auto accession=auto table=auto shrinknames taxpath=/path/to/taxonomy
Processes NCBI's nt database with concatenated headers, using shrinknames to reduce output size and high memory allocation.
Silva Database Format
gi2taxid.sh in=silva_sequences.fa out=renamed.fa silva=t tree=auto taxpath=/usr/tax/
Processes Silva 16S/18S database with specialized header parsing for semicolon-delimited taxonomy.
Multiple Input Files
gi2taxid.sh bacteria.fa archaea.fa viruses.fa out=all_renamed.fa server
Combines multiple FASTA files into a single output with taxonomy ID headers for streamlined downstream processing.
Quality Control with Invalid Sequence Separation
gi2taxid.sh in=input.fa out=valid.fa invalid=unresolved.fa server keepall=f
Separates sequences with resolved taxonomy IDs from those that couldn't be assigned, enabling quality assessment.
Workflow Integration
GI2TAXID serves as the foundation for BBTools taxonomy workflows:
Complete Taxonomy Pipeline
# Step 1: Convert identifiers to taxonomy IDs
gi2taxid.sh in=sequences.fa out=tid_sequences.fa server
# Step 2: Filter by taxonomic group
filterbytaxa.sh in=tid_sequences.fa out=bacteria_only.fa names=Bacteria level=superkingdom tree=auto
# Step 3: Bin sequences by family
splitbytaxa.sh in=bacteria_only.fa out=family_%.fa level=family tree=auto
# Step 4: Generate abundance profiles
seal.sh in=reads.fq ref=tid_sequences.fa out=abundance.txt k=31 ambiguous=random tree=auto level=species
This workflow demonstrates how GI2TAXID enables efficient downstream taxonomy operations by eliminating the need for large lookup tables in each subsequent step.
Algorithm Details
Multi-Modal Processing Architecture
RenameGiToTaxid implements four specialized processing modes through the processInner() and processInner_server() methods:
- Accession Mode (0): Extracts accession identifiers until space/period delimiters using appendHeaderLine(). Validates format with looksLikeRealAccession() checking 4-18 character length and optional version dot at position n-2. Uses AccessionToTaxid.load() to build HashMap lookup tables from NCBI files.
- GI Mode (1): Parses GI numbers by extracting characters until space/pipe delimiters. Utilizes GiToTaxid.initialize() to build integer arrays for binary gi-to-taxid mapping.
- Header Mode (2): Processes complete organism names via TaxTree.parseNodeFromHeader() using nameMap HashMap for organism-to-TaxNode matching.
- Unite Mode (3): Specialized parsing for UNITE database pipe-delimited headers with TaxTree.UNITE_MODE=true enabling custom taxonomic name extraction.
Server vs Local Processing Strategies
Local Processing: TaxTree.loadTaxTree() constructs hierarchical in-memory structures with nameMap HashMap for organism lookups. The tree maintains parent-child relationships through TaxNode objects enabling lineage traversal. AccessionToTaxid class provides hash-based accession mapping requiring substantial memory allocation.
Server Processing: Implements batched HTTP requests through updateHeadersFromServer() with configurable maxStoredBytes=10MB buffer threshold. translateToTaxIDs() handles TaxClient communication with exponential backoff retry logic (Thread.sleep(500*(i+2))) for up to 10 attempts. Queries use ByteBuilder comma-separated concatenation for efficient batch processing.
Header Processing Pipeline
The processInner() method executes multi-stage header transformation:
- Existing ID Detection: Tools.startsWith() identifies ">tid|" and ">ncbi|" prefixes, advancing the initial pointer past existing taxonomy identifiers to prevent double-processing.
- Name Compression: shrinkNames=true scans for SOH (Start of Header, byte 1) characters to truncate concatenated nr/nt headers, setting terminal position at first SOH occurrence for space efficiency.
- Format-Specific Parsing: Silva mode modifies TaxTree.parseNodeFromHeader() behavior for semicolon-delimited organism hierarchies commonly found in Silva databases.
Memory Management and Performance
Default allocation uses calcXmx() with freeRam(7000m, 84) setting -Xmx7g/-Xms7g heap parameters. For accession processing, 40GB+ allocation is recommended. HashArray1D(256000, -1L, true) provides taxonomy occurrence counting with 256K initial capacity and automatic resizing. Server mode batching limits memory through maxStoredBytes thresholds before processing accumulated headers.
Error Handling and Quality Control
- Invalid Detection: Sequences receiving taxonomy ID -1 are flagged invalid. dumpBuffer() checks for invalidTitle (">tid|-1") markers and routes invalid sequences to separate output streams.
- Accession Validation: looksLikeRealAccession() enforces format constraints including character validation with Tools.isLetterOrDigit() plus underscore, hyphen, and dot allowances.
- Server Response Validation: translateToTaxIDs() validates response length matching query term counts, implementing retry logic for null responses or length mismatches.
- Threshold Controls: maxInvalidHeaders parameter triggers KillSwitch.kill() when error limits are exceeded, providing quality control for large-scale processing.
Output Format Specification
Header reformatting uses configurable byte arrays with default title=">tid|".getBytes(). Prefix mode appends original headers after taxonomy IDs via character-by-character copying. GFF format processing replaces the first tab-delimited column with "tid|[taxid]" format. Count mode employs HashArray1D.increment() for taxonomy occurrence frequency tracking in output headers.
Performance Considerations
Memory Requirements
- GI-only processing: ~500MB (taxonomy tree only)
- Organism names: ~1GB (includes nameMap)
- Accession processing: 40-60GB (full accession tables)
- Server mode: Minimal local memory (~100MB)
Processing Speed
- Server mode: Network-limited, but no preprocessing time
- Local mode: Very fast after initial 5-10 minute loading period
- GI processing: Fastest local option
- Accession processing: Slower due to hash lookups
Recommendations
- Small datasets (<100K sequences): Use server mode
- Large datasets: Use local mode with adequate RAM
- Repeated processing: Convert to tid format once, reuse for all downstream analyses
- Memory-constrained systems: Process in smaller batches or use server mode
Troubleshooting
Common Issues
- Out of Memory Errors
- Increase -Xmx parameter (recommend 40GB+ for accession processing) or switch to server mode for memory-limited systems.
- Server Connection Failures
- Check internet connectivity. Tool implements retry logic with exponential backoff, but persistent failures may require switching to local mode.
- High Invalid Sequence Count
- Review input format compatibility. Use badheaders parameter to examine problematic sequences. Consider adjusting parsing mode or preprocessing headers.
- Slow Processing
- For large datasets with accessions, ensure sufficient RAM allocation. Consider server mode for initial processing or preprocessing to reduce dataset size.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Taxonomy Guide: Comprehensive guide covering taxonomy workflow setup and optimization