GITable

Script: gitable.sh Package: tax Class: GiToTaxid.java

Creates gitable.int2d from accession files downloaded from NCBI's taxonomy database. This tool processes accession2taxid files to build a lookup table for converting deprecated gi numbers to taxonomy IDs. While gi numbers are no longer used by NCBI and accession numbers are preferred, this tool maintains compatibility with legacy data that still references gi numbers.

Basic Usage

gitable.sh input_files output_file

The tool takes comma-separated accession2taxid files as input and produces a binary lookup table file (.int2d format).

gitable.sh shrunk.dead_nucl.accession2taxid.gz,shrunk.dead_prot.accession2taxid.gz,shrunk.dead_wgs.accession2taxid.gz,shrunk.nucl_gb.accession2taxid.gz,shrunk.nucl_wgs.accession2taxid.gz,shrunk.pdb.accession2taxid.gz,shrunk.prot.accession2taxid.gz gitable.int2d.gz

Parameters

Gitable is a specialized utility tool with minimal configuration options. It primarily accepts Java memory parameters for processing large taxonomy files.

Java Parameters

-Xmx
Sets Java's memory usage, overriding autodetection. Example: -Xmx20g specifies 20 gigabytes of RAM. The maximum is typically 85% of physical memory. Default auto-detected based on available system memory, with a default allocation of 24GB if sufficient memory is available.
-eoom
Causes the process to exit if an out-of-memory exception occurs. Requires Java 8u92 or later. This prevents the JVM from hanging when memory is exhausted during processing of large taxonomy files.
-da
Disables Java assertions. This can slightly improve performance in production environments by skipping assertion checks, though it's generally recommended to keep assertions enabled for debugging purposes.

Input File Format

Gitable processes NCBI accession2taxid files, which are tab-delimited text files with the following format:

accession	accession.version	taxid	gi

The tool specifically extracts the taxid and gi columns (columns 3 and 4) to build the lookup mapping. Input files can be:

Examples

Basic Usage

gitable.sh nucl_gb.accession2taxid.gz gitable.int2d.gz

Process a single nucleotide GenBank accession file to create a gi-to-taxid lookup table.

Multiple File Processing

gitable.sh dead_nucl.accession2taxid.gz,dead_prot.accession2taxid.gz,nucl_gb.accession2taxid.gz,prot.accession2taxid.gz gitable.int2d.gz

Process multiple accession files simultaneously to build a comprehensive lookup table from all NCBI databases.

High Memory Processing

gitable.sh -Xmx50g nucl_gb.accession2taxid.gz,nucl_wgs.accession2taxid.gz gitable.int2d.gz

Allocate 50GB of memory for processing large taxonomy files, useful when working with complete NCBI datasets.

Wildcard Processing

gitable.sh "*.accession2taxid.gz" gitable.int2d.gz

Process all accession2taxid files in the current directory using wildcard pattern matching (note: wildcard # syntax only works for relative paths in the current directory).

Algorithm Details

Gitable implements a memory-efficient 2D integer array structure to store gi-to-taxid mappings. The algorithm uses a 30-bit right shift operation (gi >>> SHIFT) combined with bitwise AND masking (gi & LOWERMASK) to partition gi numbers into upper and lower indices for the 2D array structure:

Data Structure Design

File Processing Strategy

Output Format

The tool produces binary .int2d files that can be efficiently loaded by other BBTools for fast gi-to-taxid lookups. The format supports:

Performance Characteristics

Legacy Support Context

While NCBI deprecated gi numbers in favor of accession.version identifiers, this tool maintains compatibility with:

File Requirements

Input Files

Download the required accession2taxid files from NCBI:

ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/

Commonly used files include:

Output Files

The tool generates binary lookup tables in .int2d format:

Usage Notes

Integration with BBTools

The generated gitable.int2d files are used by other BBTools for taxonomic classification:

Support

For questions and support:

See also: TaxonomyGuide.txt and fetchTaxonomy.sh for comprehensive taxonomic analysis documentation.