GITable
Creates gitable.int2d from accession files downloaded from NCBI's taxonomy database. This tool processes accession2taxid files to build a lookup table for converting deprecated gi numbers to taxonomy IDs. While gi numbers are no longer used by NCBI and accession numbers are preferred, this tool maintains compatibility with legacy data that still references gi numbers.
Basic Usage
gitable.sh input_files output_file
The tool takes comma-separated accession2taxid files as input and produces a binary lookup table file (.int2d format).
gitable.sh shrunk.dead_nucl.accession2taxid.gz,shrunk.dead_prot.accession2taxid.gz,shrunk.dead_wgs.accession2taxid.gz,shrunk.nucl_gb.accession2taxid.gz,shrunk.nucl_wgs.accession2taxid.gz,shrunk.pdb.accession2taxid.gz,shrunk.prot.accession2taxid.gz gitable.int2d.gz
Parameters
Gitable is a specialized utility tool with minimal configuration options. It primarily accepts Java memory parameters for processing large taxonomy files.
Java Parameters
- -Xmx
- Sets Java's memory usage, overriding autodetection. Example: -Xmx20g specifies 20 gigabytes of RAM. The maximum is typically 85% of physical memory. Default auto-detected based on available system memory, with a default allocation of 24GB if sufficient memory is available.
- -eoom
- Causes the process to exit if an out-of-memory exception occurs. Requires Java 8u92 or later. This prevents the JVM from hanging when memory is exhausted during processing of large taxonomy files.
- -da
- Disables Java assertions. This can slightly improve performance in production environments by skipping assertion checks, though it's generally recommended to keep assertions enabled for debugging purposes.
Input File Format
Gitable processes NCBI accession2taxid files, which are tab-delimited text files with the following format:
accession accession.version taxid gi
The tool specifically extracts the taxid and gi columns (columns 3 and 4) to build the lookup mapping. Input files can be:
- Compressed with gzip (.gz extension)
- Multiple files specified as comma-separated list
- Wildcard patterns using # (works only in current directory)
Examples
Basic Usage
gitable.sh nucl_gb.accession2taxid.gz gitable.int2d.gz
Process a single nucleotide GenBank accession file to create a gi-to-taxid lookup table.
Multiple File Processing
gitable.sh dead_nucl.accession2taxid.gz,dead_prot.accession2taxid.gz,nucl_gb.accession2taxid.gz,prot.accession2taxid.gz gitable.int2d.gz
Process multiple accession files simultaneously to build a comprehensive lookup table from all NCBI databases.
High Memory Processing
gitable.sh -Xmx50g nucl_gb.accession2taxid.gz,nucl_wgs.accession2taxid.gz gitable.int2d.gz
Allocate 50GB of memory for processing large taxonomy files, useful when working with complete NCBI datasets.
Wildcard Processing
gitable.sh "*.accession2taxid.gz" gitable.int2d.gz
Process all accession2taxid files in the current directory using wildcard pattern matching (note: wildcard # syntax only works for relative paths in the current directory).
Algorithm Details
Gitable implements a memory-efficient 2D integer array structure to store gi-to-taxid mappings. The algorithm uses a 30-bit right shift operation (gi >>> SHIFT) combined with bitwise AND masking (gi & LOWERMASK) to partition gi numbers into upper and lower indices for the 2D array structure:
Data Structure Design
- 2D Array Architecture: Uses a two-level array structure with upper and lower indices derived from bit manipulation (30-bit shift operation)
- Memory Optimization: Only allocates array slices when needed, preventing memory waste for sparse gi number ranges
- Bit Shifting Strategy: Upper index = gi >>> 30, lower index = gi & ((1L << 30) - 1), allowing efficient access to gi numbers up to 2^60
File Processing Strategy
- Incremental Loading: Processes input files sequentially, building the lookup table incrementally to handle large datasets
- Tab-Delimited Parsing: Efficiently parses tab-separated values using byte-level processing to extract taxid and gi values
- Duplicate Handling: Detects and reports conflicting gi-to-taxid mappings, maintaining data integrity
- Invalid Entry Filtering: Skips malformed lines and negative gi numbers to ensure lookup table consistency
Output Format
The tool produces binary .int2d files that can be efficiently loaded by other BBTools for fast gi-to-taxid lookups. The format supports:
- Fast random access to any gi number
- Compressed storage of sparse data
- Thread-safe concurrent access
- Automatic detection of maximum loaded gi number
Performance Characteristics
- Memory Efficiency: Sparse array implementation minimizes memory usage for datasets with gaps in gi number sequences
- Lookup Speed: O(1) constant time access to any gi number through direct array indexing
- Scalability: Can handle gi numbers up to 2^60, accommodating NCBI's entire historical range
- Processing Speed: Byte-level file processing optimized for large input files (several GB)
Legacy Support Context
While NCBI deprecated gi numbers in favor of accession.version identifiers, this tool maintains compatibility with:
- Historical sequence databases that reference gi numbers
- Legacy analysis pipelines that haven't migrated to accession numbers
- Existing taxonomic classification workflows in BBTools
- Research reproducibility requiring original gi-based identifiers
File Requirements
Input Files
Download the required accession2taxid files from NCBI:
ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/
Commonly used files include:
dead_nucl.accession2taxid.gz
- Discontinued nucleotide accessionsdead_prot.accession2taxid.gz
- Discontinued protein accessionsdead_wgs.accession2taxid.gz
- Discontinued WGS accessionsnucl_gb.accession2taxid.gz
- GenBank nucleotide accessionsnucl_wgs.accession2taxid.gz
- WGS nucleotide accessionspdb.accession2taxid.gz
- Protein Data Bank accessionsprot.accession2taxid.gz
- GenBank protein accessions
Output Files
The tool generates binary lookup tables in .int2d format:
gitable.int2d
- Uncompressed binary lookup tablegitable.int2d.gz
- Compressed binary lookup table (recommended for storage)
Usage Notes
- Memory Requirements: Processing complete NCBI datasets requires significant memory (24GB+ recommended)
- Processing Time: Large taxonomy files can take substantial time to process; monitor system resources
- File Size: Output .int2d files can be several gigabytes depending on input coverage
- Compression: Use gzip compression for output files to save storage space
- Validation: The tool reports processing statistics including valid and invalid entry counts
- Updates: Regenerate lookup tables when NCBI updates taxonomy databases
Integration with BBTools
The generated gitable.int2d files are used by other BBTools for taxonomic classification:
- Taxonomy Tools: Used by various BBTools for converting gi numbers in sequence headers to taxonomy IDs
- Classification Pipelines: Enables taxonomic assignment in metagenomic analysis workflows
- Legacy Data Processing: Allows analysis of older datasets that use gi number identifiers
- Cross-Reference: See TaxonomyGuide.txt and fetchTaxonomy.sh for complete taxonomic analysis workflows
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
See also: TaxonomyGuide.txt and fetchTaxonomy.sh for comprehensive taxonomic analysis documentation.