GITable

Script: gitable.sh Package: tax Class: GiToTaxid.java

Creates gitable.int2d from accession files downloaded from NCBI's taxonomy database. This tool processes accession2taxid files to build a lookup table for converting deprecated gi numbers to taxonomy IDs. While gi numbers are no longer used by NCBI and accession numbers are preferred, this tool maintains compatibility with legacy data that still references gi numbers.

Basic Usage

gitable.sh input_files output_file

The tool takes comma-separated accession2taxid files as input and produces a binary lookup table file (.int2d format).

gitable.sh shrunk.dead_nucl.accession2taxid.gz,shrunk.dead_prot.accession2taxid.gz,shrunk.dead_wgs.accession2taxid.gz,shrunk.nucl_gb.accession2taxid.gz,shrunk.nucl_wgs.accession2taxid.gz,shrunk.pdb.accession2taxid.gz,shrunk.prot.accession2taxid.gz gitable.int2d.gz

Parameters

Gitable is a specialized utility tool with minimal configuration options. It primarily accepts Java memory parameters for processing large taxonomy files.

Java Parameters

-Xmx: Sets Java's memory usage, overriding autodetection. Example: -Xmx20g specifies 20 gigabytes of RAM. The maximum is typically 85% of physical memory. Default auto-detected based on available system memory, with a default allocation of 24GB if sufficient memory is available.
-eoom: Causes the process to exit if an out-of-memory exception occurs. Requires Java 8u92 or later. This prevents the JVM from hanging when memory is exhausted during processing of large taxonomy files.
-da: Disables Java assertions. This can slightly improve performance in production environments by skipping assertion checks, though it's generally recommended to keep assertions enabled for debugging purposes.

Input File Format

Gitable processes NCBI accession2taxid files, which are tab-delimited text files with the following format:

accession	accession.version	taxid	gi

The tool specifically extracts the taxid and gi columns (columns 3 and 4) to build the lookup mapping. Input files can be:

Compressed with gzip (.gz extension)
Multiple files specified as comma-separated list
Wildcard patterns using # (works only in current directory)

Examples

Basic Usage

gitable.sh nucl_gb.accession2taxid.gz gitable.int2d.gz

Process a single nucleotide GenBank accession file to create a gi-to-taxid lookup table.

Multiple File Processing

gitable.sh dead_nucl.accession2taxid.gz,dead_prot.accession2taxid.gz,nucl_gb.accession2taxid.gz,prot.accession2taxid.gz gitable.int2d.gz

Process multiple accession files simultaneously to build a comprehensive lookup table from all NCBI databases.

High Memory Processing

gitable.sh -Xmx50g nucl_gb.accession2taxid.gz,nucl_wgs.accession2taxid.gz gitable.int2d.gz

Allocate 50GB of memory for processing large taxonomy files, useful when working with complete NCBI datasets.

Wildcard Processing

gitable.sh "*.accession2taxid.gz" gitable.int2d.gz

Process all accession2taxid files in the current directory using wildcard pattern matching (note: wildcard # syntax only works for relative paths in the current directory).

Algorithm Details

Gitable implements a memory-efficient 2D integer array structure to store gi-to-taxid mappings. The algorithm uses a 30-bit right shift operation (gi >>> SHIFT) combined with bitwise AND masking (gi & LOWERMASK) to partition gi numbers into upper and lower indices for the 2D array structure:

Data Structure Design

2D Array Architecture: Uses a two-level array structure with upper and lower indices derived from bit manipulation (30-bit shift operation)
Memory Optimization: Only allocates array slices when needed, preventing memory waste for sparse gi number ranges
Bit Shifting Strategy: Upper index = gi >>> 30, lower index = gi & ((1L << 30) - 1), allowing efficient access to gi numbers up to 2^60

File Processing Strategy

Incremental Loading: Processes input files sequentially, building the lookup table incrementally to handle large datasets
Tab-Delimited Parsing: Efficiently parses tab-separated values using byte-level processing to extract taxid and gi values
Duplicate Handling: Detects and reports conflicting gi-to-taxid mappings, maintaining data integrity
Invalid Entry Filtering: Skips malformed lines and negative gi numbers to ensure lookup table consistency

Output Format

The tool produces binary .int2d files that can be efficiently loaded by other BBTools for fast gi-to-taxid lookups. The format supports:

Fast random access to any gi number
Compressed storage of sparse data
Thread-safe concurrent access
Automatic detection of maximum loaded gi number

Performance Characteristics

Memory Efficiency: Sparse array implementation minimizes memory usage for datasets with gaps in gi number sequences
Lookup Speed: O(1) constant time access to any gi number through direct array indexing
Scalability: Can handle gi numbers up to 2^60, accommodating NCBI's entire historical range
Processing Speed: Byte-level file processing optimized for large input files (several GB)

Legacy Support Context

While NCBI deprecated gi numbers in favor of accession.version identifiers, this tool maintains compatibility with:

Historical sequence databases that reference gi numbers
Legacy analysis pipelines that haven't migrated to accession numbers
Existing taxonomic classification workflows in BBTools
Research reproducibility requiring original gi-based identifiers

File Requirements

Input Files

Download the required accession2taxid files from NCBI:

ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/

Commonly used files include:

dead_nucl.accession2taxid.gz - Discontinued nucleotide accessions
dead_prot.accession2taxid.gz - Discontinued protein accessions
dead_wgs.accession2taxid.gz - Discontinued WGS accessions
nucl_gb.accession2taxid.gz - GenBank nucleotide accessions
nucl_wgs.accession2taxid.gz - WGS nucleotide accessions
pdb.accession2taxid.gz - Protein Data Bank accessions
prot.accession2taxid.gz - GenBank protein accessions

Output Files

The tool generates binary lookup tables in .int2d format:

gitable.int2d - Uncompressed binary lookup table
gitable.int2d.gz - Compressed binary lookup table (recommended for storage)

Usage Notes

Memory Requirements: Processing complete NCBI datasets requires significant memory (24GB+ recommended)
Processing Time: Large taxonomy files can take substantial time to process; monitor system resources
File Size: Output .int2d files can be several gigabytes depending on input coverage
Compression: Use gzip compression for output files to save storage space
Validation: The tool reports processing statistics including valid and invalid entry counts
Updates: Regenerate lookup tables when NCBI updates taxonomy databases

Integration with BBTools

The generated gitable.int2d files are used by other BBTools for taxonomic classification:

Taxonomy Tools: Used by various BBTools for converting gi numbers in sequence headers to taxonomy IDs
Classification Pipelines: Enables taxonomic assignment in metagenomic analysis workflows
Legacy Data Processing: Allows analysis of older datasets that use gi number identifiers
Cross-Reference: See TaxonomyGuide.txt and fetchTaxonomy.sh for complete taxonomic analysis workflows

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org

See also: TaxonomyGuide.txt and fetchTaxonomy.sh for comprehensive taxonomic analysis documentation.