Fetch Taxonomy Pipeline

Overview

The Fetch Taxonomy pipeline automates the complex process of downloading and formatting NCBI's complete taxonomy database for use with BBTools. This includes accession-to-taxid mapping files, taxonomic tree structures, and optimized lookup tables that enable BBTools to perform rapid taxonomic classification and filtering operations.

The pipeline downloads seven different accession2taxid files covering nucleotide sequences, protein sequences, whole genome shotgun data, and protein database entries. It then processes these files along with the NCBI taxonomic tree to create optimized data structures for fast taxonomic queries.

Note: This pipeline creates the core taxonomic infrastructure required by many BBTools programs. The generated files can be referenced using "taxpath=X" where X is the location of the files generated by this script.

Prerequisites

System Requirements

BBTools suite with taxonomic support
At least 24GB RAM for processing steps
Network connectivity to ftp.ncbi.nih.gov
wget for file downloads
unzip utility for archive extraction
Sufficient disk space (several GB for raw and processed files)

Dependencies

shrinkaccession.sh - For compressing accession2taxid files
taxtree.sh - For creating taxonomic tree structure
gitable.sh - For creating optimized gi number lookup tables
analyzeaccession.sh - For analyzing accession patterns

Pipeline Stages

1. Accession2TaxID Download and Processing Phase

Downloads and processes seven different types of accession-to-taxonomy ID mapping files from NCBI in parallel:

1.1 Dead Nucleotide Sequences

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_nucl.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.dead_nucl.accession2taxid.gz zl=9 t=4 &

Downloads and processes withdrawn/obsolete nucleotide sequence accessions with compression level 9 using 4 threads.

1.2 Dead Protein Sequences

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_prot.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.dead_prot.accession2taxid.gz zl=9 t=6 &

Downloads and processes withdrawn/obsolete protein sequence accessions using 6 threads for processing.

1.3 Dead Whole Genome Shotgun

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_wgs.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.dead_wgs.accession2taxid.gz zl=9 t=6 &

Downloads and processes withdrawn whole genome shotgun accessions using 6 threads.

1.4 GenBank Nucleotide Sequences

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.nucl_gb.accession2taxid.gz zl=9 t=8 &

Downloads and processes current GenBank nucleotide sequence accessions using 8 threads for higher throughput on this large dataset.

1.5 Whole Genome Shotgun Nucleotide

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.nucl_wgs.accession2taxid.gz zl=9 t=10 &

Downloads and processes current whole genome shotgun nucleotide accessions using 10 threads due to the large size of this dataset.

1.6 Protein Data Bank (PDB)

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/pdb.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.pdb.accession2taxid.gz zl=9 t=4 &

Downloads and processes Protein Data Bank accessions using 4 threads (smaller dataset).

1.7 Protein Sequences

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.prot.accession2taxid.gz zl=9 t=10

Downloads and processes current protein sequence accessions using 10 threads. This step runs in foreground to ensure all parallel downloads complete before proceeding.

2. Taxonomic Tree Download Phase

2.1 Taxonomy Dump Download

wget -nv ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip

Downloads the complete NCBI taxonomic database dump containing names.dmp, nodes.dmp, merged.dmp and other taxonomic structure files.

2.2 Archive Extraction

unzip -o taxdmp.zip

Extracts the taxonomic dump files, overwriting any existing files to ensure fresh data.

3. Database Generation Phase

3.1 Taxonomic Tree Construction

time taxtree.sh names.dmp nodes.dmp merged.dmp tree.taxtree.gz -Xmx16g

Creates the optimized taxonomic tree structure (tree.taxtree.gz) from the NCBI taxonomy files using 16GB RAM. This tree enables rapid taxonomic hierarchy traversal and lookup operations.

3.2 GI Table Construction

time gitable.sh shrunk.dead_nucl.accession2taxid.gz,shrunk.dead_prot.accession2taxid.gz,\
shrunk.dead_wgs.accession2taxid.gz,shrunk.nucl_gb.accession2taxid.gz,\
shrunk.nucl_wgs.accession2taxid.gz,shrunk.pdb.accession2taxid.gz,\
shrunk.prot.accession2taxid.gz gitable.int1d.gz -Xmx24g

Creates the optimized GI number lookup table (gitable.int1d.gz) from all processed accession2taxid files using 24GB RAM. This enables rapid conversion from GI numbers to taxonomic IDs.

3.3 Accession Pattern Analysis

time analyzeaccession.sh shrunk.*.accession2taxid.gz out=patterns.txt

Analyzes all processed accession files to identify patterns for compression optimization and creates a patterns report.

4. Cleanup Phase

Removes temporary and intermediate files to conserve disk space:

gi_*.dmp.gz - Temporary GI mapping files
*.dmp - Extracted taxonomy dump files
gc.prt - Genetic code table (not needed for BBTools)
readme.txt - NCBI readme file

Basic Usage

# 1. Navigate to a working directory with sufficient space
cd /path/to/taxonomy/directory

# 2. Run the fetch taxonomy pipeline
bash fetchTaxonomy.sh

# 3. Generated files will be ready for use with BBTools taxonomic programs

Warning: This pipeline downloads several GB of data and requires significant processing time and memory. Ensure adequate system resources before starting.

Output Files

Core Taxonomic Database Files

tree.taxtree.gz - Optimized taxonomic tree structure for hierarchy traversal
gitable.int1d.gz - Optimized GI number to taxonomic ID lookup table
patterns.txt - Analysis report of accession patterns for optimization

Processed Accession Files

shrunk.dead_nucl.accession2taxid.gz - Compressed obsolete nucleotide accessions
shrunk.dead_prot.accession2taxid.gz - Compressed obsolete protein accessions
shrunk.dead_wgs.accession2taxid.gz - Compressed obsolete WGS accessions
shrunk.nucl_gb.accession2taxid.gz - Compressed GenBank nucleotide accessions
shrunk.nucl_wgs.accession2taxid.gz - Compressed WGS nucleotide accessions
shrunk.pdb.accession2taxid.gz - Compressed PDB protein accessions
shrunk.prot.accession2taxid.gz - Compressed protein sequence accessions

Downloaded Raw Files (Cleaned Up)

taxdmp.zip - Original NCBI taxonomy dump (kept for reference)
Temporary .dmp files are automatically removed after processing

Algorithm Details

Parallel Download Strategy

The pipeline uses parallel downloads with streaming compression to optimize both network and disk usage:

Stream Processing Architecture

Each accession2taxid file is processed using a streaming approach:

Direct streaming: wget outputs to stdout, piped directly to shrinkaccession.sh
No disk buffering: Raw files never touch disk, saving space
Immediate compression: Output is compressed during processing
Thread optimization: Different thread counts based on expected file sizes

Thread Allocation Strategy

Thread counts are optimized based on file characteristics:

Large datasets: nucl_wgs (10 threads), prot (10 threads), nucl_gb (8 threads)
Medium datasets: dead_prot (6 threads), dead_wgs (6 threads)
Small datasets: dead_nucl (4 threads), pdb (4 threads)
Balancing strategy: Prevents thread contention while maximizing throughput

Database Optimization Process

Accession File Compression

ShrinkAccession tool removes unnecessary columns and optimizes format:

Column reduction: Removes unneeded metadata columns
Format optimization: Converts to binary or optimized text format
Compression level 9: Maximum compression for long-term storage
Faster loading: Processed files load significantly faster than raw NCBI files

Taxonomic Tree Generation

TaxTree creates an optimized tree structure from NCBI dump files:

Binary tree format: Enables O(log n) taxonomic lookups
Hierarchy preservation: Maintains complete NCBI taxonomic structure
Merged node handling: Processes merged.dmp to handle taxonomic ID changes
Memory optimization: Uses 16GB for large taxonomic tree construction

GI Table Creation

GiTable builds optimized lookup structures for legacy GI number support:

Integer array format: Direct mapping from GI numbers to taxonomic IDs
Memory efficiency: Compressed integer arrays reduce memory footprint
Legacy support: Enables tools to work with older sequence files containing GI numbers
High memory usage: Requires 24GB due to large GI number space

Performance Characteristics

Download time: Several minutes to hours depending on network speed
Processing memory: Peak usage 24GB for GI table generation
Parallel efficiency: 6-7 parallel downloads reduce total pipeline time
Compression efficiency: Level 9 compression significantly reduces final file sizes
Error handling: Pipeline stops on first failure (set -e) for data integrity

Data Sources

NCBI Accession2TaxID Files

The pipeline downloads seven categories of accession-to-taxonomy mappings:

Active Database Files

nucl_gb.accession2taxid.gz - All GenBank nucleotide sequences (largest dataset)
nucl_wgs.accession2taxid.gz - Whole genome shotgun nucleotide sequences
prot.accession2taxid.gz - All protein sequences (large dataset)
pdb.accession2taxid.gz - Protein Data Bank structure accessions

Historical/Dead Database Files

dead_nucl.accession2taxid.gz - Withdrawn nucleotide accessions
dead_prot.accession2taxid.gz - Withdrawn protein accessions
dead_wgs.accession2taxid.gz - Withdrawn WGS accessions

NCBI Taxonomy Dump (taxdmp.zip)

Contains the complete NCBI taxonomic hierarchy:

names.dmp - Taxonomic names and synonyms
nodes.dmp - Taxonomic tree structure and ranks
merged.dmp - Merged taxonomic IDs (for handling ID changes)
Other files: citations.dmp, delnodes.dmp, division.dmp, gc.prt, gencode.dmp

Integration with BBTools

Using Generated Databases

The generated files enable taxonomic functionality in BBTools programs. Set the taxonomy path using:

Basic Taxonomy Path Setting

# For most BBTools programs
tool.sh in=sequences.fq taxpath=/path/to/taxonomy/files

# For automatic tree loading
tool.sh in=sequences.fq taxpath=/path/to/taxonomy/files tree=auto

Programs That Use These Files

seal.sh - Taxonomic filtering and classification
sortbytaxa.sh - Sort sequences by taxonomic hierarchy
filterbytaxa.sh - Filter sequences by taxonomic criteria
taxonomy.sh - Taxonomic lookup and conversion tools
gi2taxid.sh - Convert GI numbers to taxonomic IDs
sketch.sh - Create taxonomically-aware sketch databases
sendsketch.sh - Taxonomic classification via sketch matching
bbmap.sh - Alignment with taxonomic annotation

File Locations for BBTools

After running this pipeline, BBTools programs can find taxonomy files using:

tree.taxtree.gz - Automatically detected when using tree=auto
gitable.int1d.gz - Automatically detected for GI number support
shrunk.*.accession2taxid.gz - Used internally by taxonomic tools
patterns.txt - Used for accession compression optimization

Database Background

NCBI Taxonomy Database

The NCBI Taxonomy database is the authoritative source for organismal taxonomy used by molecular databases:

Universal coverage: Includes all organisms represented in NCBI sequence databases
Hierarchical structure: Standard taxonomic ranks from kingdom to species
Regular updates: Updated continuously as new species are sequenced and described
Cross-references: Links organisms to all associated sequence data
Merged nodes: Handles taxonomic ID changes and reorganization over time

Accession2TaxID System

NCBI's accession-to-taxonomy mapping system enables sequence-level taxonomic annotation:

Accession coverage: Maps sequence accessions to taxonomic IDs
Multiple databases: Separate files for different sequence types
Historical data: Includes withdrawn sequences for backward compatibility
File structure: Tab-delimited format with accession, version, taxid, gi columns

BBTools Optimization

BBTools requires specific optimizations for efficient taxonomic operations:

Binary tree format: Faster than text-based hierarchies
Integer arrays: Direct indexing rather than hash table lookups
Compressed storage: Reduces memory footprint for large databases
Streaming support: Can process data without loading entire databases

Memory and Performance

Memory Requirements by Stage

Download phase: Minimal memory, primarily network and disk I/O
ShrinkAccession: Low memory per process, but 6-7 parallel processes
TaxTree generation: 16GB peak memory for tree construction
GiTable generation: 24GB peak memory for integer array creation
Pattern analysis: Low memory, file scanning operation

Processing Time Estimates

Download phase: 10-60 minutes depending on network speed
Accession processing: 20-40 minutes for compression and format conversion
Tree generation: 5-15 minutes for taxonomic hierarchy construction
GI table generation: 30-90 minutes for large integer array construction
Pattern analysis: 5-10 minutes for file scanning
Total pipeline time: 1.5-3 hours for complete execution

Disk Space Requirements

Raw downloads: 8-12 GB for all accession2taxid files
Processed files: 4-6 GB after compression and optimization
Working space: Additional 2-4 GB for temporary files during processing
Final database: 3-5 GB for complete taxonomic infrastructure

Troubleshooting

Download Issues

Network timeouts: NCBI FTP server may be slow or temporarily unavailable
Partial downloads: Check file sizes against expected values
Connection limits: Some networks limit simultaneous connections
File corruption: Re-run pipeline if checksums don't match

Memory Issues

TaxTree failures: Reduce -Xmx16g if system has insufficient RAM
GiTable failures: GI table generation requires substantial memory (24GB minimum recommended)
System swapping: Ensure adequate physical RAM to avoid swap thrashing
Java heap errors: Monitor system memory during processing phases

Processing Failures

Parallel process failures: One failed download will cause pipeline exit
Disk space exhaustion: Monitor disk usage during processing
Permission errors: Ensure write permissions in working directory
Tool dependencies: Verify all BBTools components are properly installed

Data Integrity

File size verification: Compare generated file sizes to expected ranges
Content validation: Check that tree.taxtree.gz and gitable.int1d.gz load properly
Pattern analysis: Review patterns.txt for anomalies in accession data
Test integration: Verify generated files work with taxonomic BBTools programs

Database Maintenance

Regular Updates

NCBI taxonomy is updated regularly, requiring periodic database regeneration:

Update frequency: Monthly updates recommended for active research
Incremental changes: New species additions, taxonomic reclassifications
Accession additions: New sequence submissions require updated mappings
Backward compatibility: Old files remain functional but may lack newest data

Version Control

Timestamp tracking: Note download date for reproducibility
File versioning: Consider backing up working databases before updates
Consistency checking: Verify all files are from the same update cycle
Change documentation: Track significant taxonomic changes between updates

Storage Management

Archive old versions: Keep previous databases for reproducibility
Compression efficiency: Use maximum compression for archived versions
Access patterns: Most recent databases should be on fastest storage
Cleanup automation: Pipeline automatically removes unnecessary temporary files

Example Workflows

Initial Setup for New System

# 1. Create dedicated taxonomy directory
mkdir -p /data/taxonomy/ncbi
cd /data/taxonomy/ncbi

# 2. Run fetch taxonomy pipeline
bash /path/to/bbtools/pipelines/fetch/fetchTaxonomy.sh

# 3. Set environment variable for other BBTools
export TAXPATH=/data/taxonomy/ncbi

# 4. Test with a simple taxonomic tool
taxonomy.sh tree=tree.taxtree.gz homo_sapiens

Updating Existing Database

# 1. Backup current database
cp -r /data/taxonomy/ncbi /data/taxonomy/ncbi_backup_$(date +%Y%m%d)

# 2. Clean working directory
cd /data/taxonomy/ncbi
rm -f shrunk.* tree.taxtree.gz gitable.int1d.gz patterns.txt

# 3. Run update
bash fetchTaxonomy.sh

# 4. Verify database integrity
du -h tree.taxtree.gz gitable.int1d.gz  # Check file sizes are reasonable
taxonomy.sh tree=tree.taxtree.gz escherichia_coli  # Test functionality

Integrating with Analysis Pipeline

# 1. Set taxonomy path for session
export TAXPATH=/data/taxonomy/ncbi

# 2. Use in sequence classification
sendsketch.sh in=unknown_sequences.fa.gz tree=auto

# 3. Use in filtering pipeline
filterbytaxa.sh in=mixed_sequences.fa.gz out=bacteria_only.fa.gz \
    include tree=auto level=superkingdom id=bacteria

# 4. Use in mapping with taxonomic annotation
bbmap.sh in=reads.fq.gz ref=reference.fa.gz \
    taxpath=auto printunmappedcount

Advanced Configuration

Customizing Thread Usage

For systems with different CPU configurations, you can modify the shrinkaccession thread counts in the script:

High-CPU systems: Increase thread counts for faster processing
Low-CPU systems: Decrease thread counts to prevent overload
Memory-bound systems: Reduce parallel downloads to lower peak memory

Memory Optimization

For systems with memory constraints:

TaxTree: Can reduce from 16GB but may increase processing time
GiTable: 24GB is typically minimum for complete GI number space
Sequential processing: Process accession files one at a time instead of parallel

Network Configuration

Proxy support: Configure wget proxy settings if needed
Connection limits: Some networks restrict simultaneous FTP connections
Bandwidth management: Use wget rate limiting on shared connections

Notes and Considerations

One-time setup: This pipeline typically needs to be run only once per system, then periodically for updates
Resource intensive: Requires substantial computational resources and time
Network dependency: Requires reliable internet connection to NCBI FTP servers
System compatibility: Works on any system with BBTools, wget, and unzip
Error tolerance: Pipeline stops on first failure to prevent corrupted databases
Cleanup automation: Automatically removes intermediate files to save disk space
Performance monitoring: Key steps are timed for performance analysis
Thread optimization: Different thread counts optimized for different file sizes
Compression efficiency: Maximum compression reduces storage requirements
Legacy support: Maintains compatibility with older sequence files using GI numbers