Fetch Taxonomy Pipeline
Automated pipeline for downloading and processing the latest NCBI taxonomy databases including accession2taxid files and taxonomic tree data. Creates optimized taxonomic databases for use with BBTools taxonomic classification and filtering tools.
Overview
The Fetch Taxonomy pipeline automates the complex process of downloading and formatting NCBI's complete taxonomy database for use with BBTools. This includes accession-to-taxid mapping files, taxonomic tree structures, and optimized lookup tables that enable BBTools to perform rapid taxonomic classification and filtering operations.
The pipeline downloads seven different accession2taxid files covering nucleotide sequences, protein sequences, whole genome shotgun data, and protein database entries. It then processes these files along with the NCBI taxonomic tree to create optimized data structures for fast taxonomic queries.
Prerequisites
System Requirements
- BBTools suite with taxonomic support
- At least 24GB RAM for processing steps
- Network connectivity to ftp.ncbi.nih.gov
- wget for file downloads
- unzip utility for archive extraction
- Sufficient disk space (several GB for raw and processed files)
Dependencies
- shrinkaccession.sh - For compressing accession2taxid files
- taxtree.sh - For creating taxonomic tree structure
- gitable.sh - For creating optimized gi number lookup tables
- analyzeaccession.sh - For analyzing accession patterns
Pipeline Stages
1. Accession2TaxID Download and Processing Phase
Downloads and processes seven different types of accession-to-taxonomy ID mapping files from NCBI in parallel:
1.1 Dead Nucleotide Sequences
wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_nucl.accession2taxid.gz | \
shrinkaccession.sh in=stdin.txt.gz out=shrunk.dead_nucl.accession2taxid.gz zl=9 t=4 &
Downloads and processes withdrawn/obsolete nucleotide sequence accessions with compression level 9 using 4 threads.
1.2 Dead Protein Sequences
wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_prot.accession2taxid.gz | \
shrinkaccession.sh in=stdin.txt.gz out=shrunk.dead_prot.accession2taxid.gz zl=9 t=6 &
Downloads and processes withdrawn/obsolete protein sequence accessions using 6 threads for processing.
1.3 Dead Whole Genome Shotgun
wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_wgs.accession2taxid.gz | \
shrinkaccession.sh in=stdin.txt.gz out=shrunk.dead_wgs.accession2taxid.gz zl=9 t=6 &
Downloads and processes withdrawn whole genome shotgun accessions using 6 threads.
1.4 GenBank Nucleotide Sequences
wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz | \
shrinkaccession.sh in=stdin.txt.gz out=shrunk.nucl_gb.accession2taxid.gz zl=9 t=8 &
Downloads and processes current GenBank nucleotide sequence accessions using 8 threads for higher throughput on this large dataset.
1.5 Whole Genome Shotgun Nucleotide
wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz | \
shrinkaccession.sh in=stdin.txt.gz out=shrunk.nucl_wgs.accession2taxid.gz zl=9 t=10 &
Downloads and processes current whole genome shotgun nucleotide accessions using 10 threads due to the large size of this dataset.
1.6 Protein Data Bank (PDB)
wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/pdb.accession2taxid.gz | \
shrinkaccession.sh in=stdin.txt.gz out=shrunk.pdb.accession2taxid.gz zl=9 t=4 &
Downloads and processes Protein Data Bank accessions using 4 threads (smaller dataset).
1.7 Protein Sequences
wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz | \
shrinkaccession.sh in=stdin.txt.gz out=shrunk.prot.accession2taxid.gz zl=9 t=10
Downloads and processes current protein sequence accessions using 10 threads. This step runs in foreground to ensure all parallel downloads complete before proceeding.
2. Taxonomic Tree Download Phase
2.1 Taxonomy Dump Download
wget -nv ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
Downloads the complete NCBI taxonomic database dump containing names.dmp, nodes.dmp, merged.dmp and other taxonomic structure files.
2.2 Archive Extraction
unzip -o taxdmp.zip
Extracts the taxonomic dump files, overwriting any existing files to ensure fresh data.
3. Database Generation Phase
3.1 Taxonomic Tree Construction
time taxtree.sh names.dmp nodes.dmp merged.dmp tree.taxtree.gz -Xmx16g
Creates the optimized taxonomic tree structure (tree.taxtree.gz) from the NCBI taxonomy files using 16GB RAM. This tree enables rapid taxonomic hierarchy traversal and lookup operations.
3.2 GI Table Construction
time gitable.sh shrunk.dead_nucl.accession2taxid.gz,shrunk.dead_prot.accession2taxid.gz,\
shrunk.dead_wgs.accession2taxid.gz,shrunk.nucl_gb.accession2taxid.gz,\
shrunk.nucl_wgs.accession2taxid.gz,shrunk.pdb.accession2taxid.gz,\
shrunk.prot.accession2taxid.gz gitable.int1d.gz -Xmx24g
Creates the optimized GI number lookup table (gitable.int1d.gz) from all processed accession2taxid files using 24GB RAM. This enables rapid conversion from GI numbers to taxonomic IDs.
3.3 Accession Pattern Analysis
time analyzeaccession.sh shrunk.*.accession2taxid.gz out=patterns.txt
Analyzes all processed accession files to identify patterns for compression optimization and creates a patterns report.
4. Cleanup Phase
Removes temporary and intermediate files to conserve disk space:
- gi_*.dmp.gz - Temporary GI mapping files
- *.dmp - Extracted taxonomy dump files
- gc.prt - Genetic code table (not needed for BBTools)
- readme.txt - NCBI readme file
Basic Usage
# 1. Navigate to a working directory with sufficient space
cd /path/to/taxonomy/directory
# 2. Run the fetch taxonomy pipeline
bash fetchTaxonomy.sh
# 3. Generated files will be ready for use with BBTools taxonomic programs
Output Files
Core Taxonomic Database Files
- tree.taxtree.gz - Optimized taxonomic tree structure for hierarchy traversal
- gitable.int1d.gz - Optimized GI number to taxonomic ID lookup table
- patterns.txt - Analysis report of accession patterns for optimization
Processed Accession Files
- shrunk.dead_nucl.accession2taxid.gz - Compressed obsolete nucleotide accessions
- shrunk.dead_prot.accession2taxid.gz - Compressed obsolete protein accessions
- shrunk.dead_wgs.accession2taxid.gz - Compressed obsolete WGS accessions
- shrunk.nucl_gb.accession2taxid.gz - Compressed GenBank nucleotide accessions
- shrunk.nucl_wgs.accession2taxid.gz - Compressed WGS nucleotide accessions
- shrunk.pdb.accession2taxid.gz - Compressed PDB protein accessions
- shrunk.prot.accession2taxid.gz - Compressed protein sequence accessions
Downloaded Raw Files (Cleaned Up)
- taxdmp.zip - Original NCBI taxonomy dump (kept for reference)
- Temporary .dmp files are automatically removed after processing
Algorithm Details
Parallel Download Strategy
The pipeline uses parallel downloads with streaming compression to optimize both network and disk usage:
Stream Processing Architecture
Each accession2taxid file is processed using a streaming approach:
- Direct streaming: wget outputs to stdout, piped directly to shrinkaccession.sh
- No disk buffering: Raw files never touch disk, saving space
- Immediate compression: Output is compressed during processing
- Thread optimization: Different thread counts based on expected file sizes
Thread Allocation Strategy
Thread counts are optimized based on file characteristics:
- Large datasets: nucl_wgs (10 threads), prot (10 threads), nucl_gb (8 threads)
- Medium datasets: dead_prot (6 threads), dead_wgs (6 threads)
- Small datasets: dead_nucl (4 threads), pdb (4 threads)
- Balancing strategy: Prevents thread contention while maximizing throughput
Database Optimization Process
Accession File Compression
ShrinkAccession tool removes unnecessary columns and optimizes format:
- Column reduction: Removes unneeded metadata columns
- Format optimization: Converts to binary or optimized text format
- Compression level 9: Maximum compression for long-term storage
- Faster loading: Processed files load significantly faster than raw NCBI files
Taxonomic Tree Generation
TaxTree creates an optimized tree structure from NCBI dump files:
- Binary tree format: Enables O(log n) taxonomic lookups
- Hierarchy preservation: Maintains complete NCBI taxonomic structure
- Merged node handling: Processes merged.dmp to handle taxonomic ID changes
- Memory optimization: Uses 16GB for large taxonomic tree construction
GI Table Creation
GiTable builds optimized lookup structures for legacy GI number support:
- Integer array format: Direct mapping from GI numbers to taxonomic IDs
- Memory efficiency: Compressed integer arrays reduce memory footprint
- Legacy support: Enables tools to work with older sequence files containing GI numbers
- High memory usage: Requires 24GB due to large GI number space
Performance Characteristics
- Download time: Several minutes to hours depending on network speed
- Processing memory: Peak usage 24GB for GI table generation
- Parallel efficiency: 6-7 parallel downloads reduce total pipeline time
- Compression efficiency: Level 9 compression significantly reduces final file sizes
- Error handling: Pipeline stops on first failure (set -e) for data integrity
Data Sources
NCBI Accession2TaxID Files
The pipeline downloads seven categories of accession-to-taxonomy mappings:
Active Database Files
- nucl_gb.accession2taxid.gz - All GenBank nucleotide sequences (largest dataset)
- nucl_wgs.accession2taxid.gz - Whole genome shotgun nucleotide sequences
- prot.accession2taxid.gz - All protein sequences (large dataset)
- pdb.accession2taxid.gz - Protein Data Bank structure accessions
Historical/Dead Database Files
- dead_nucl.accession2taxid.gz - Withdrawn nucleotide accessions
- dead_prot.accession2taxid.gz - Withdrawn protein accessions
- dead_wgs.accession2taxid.gz - Withdrawn WGS accessions
NCBI Taxonomy Dump (taxdmp.zip)
Contains the complete NCBI taxonomic hierarchy:
- names.dmp - Taxonomic names and synonyms
- nodes.dmp - Taxonomic tree structure and ranks
- merged.dmp - Merged taxonomic IDs (for handling ID changes)
- Other files: citations.dmp, delnodes.dmp, division.dmp, gc.prt, gencode.dmp
Integration with BBTools
Using Generated Databases
The generated files enable taxonomic functionality in BBTools programs. Set the taxonomy path using:
Basic Taxonomy Path Setting
# For most BBTools programs
tool.sh in=sequences.fq taxpath=/path/to/taxonomy/files
# For automatic tree loading
tool.sh in=sequences.fq taxpath=/path/to/taxonomy/files tree=auto
Programs That Use These Files
- seal.sh - Taxonomic filtering and classification
- sortbytaxa.sh - Sort sequences by taxonomic hierarchy
- filterbytaxa.sh - Filter sequences by taxonomic criteria
- taxonomy.sh - Taxonomic lookup and conversion tools
- gi2taxid.sh - Convert GI numbers to taxonomic IDs
- sketch.sh - Create taxonomically-aware sketch databases
- sendsketch.sh - Taxonomic classification via sketch matching
- bbmap.sh - Alignment with taxonomic annotation
File Locations for BBTools
After running this pipeline, BBTools programs can find taxonomy files using:
- tree.taxtree.gz - Automatically detected when using tree=auto
- gitable.int1d.gz - Automatically detected for GI number support
- shrunk.*.accession2taxid.gz - Used internally by taxonomic tools
- patterns.txt - Used for accession compression optimization
Database Background
NCBI Taxonomy Database
The NCBI Taxonomy database is the authoritative source for organismal taxonomy used by molecular databases:
- Universal coverage: Includes all organisms represented in NCBI sequence databases
- Hierarchical structure: Standard taxonomic ranks from kingdom to species
- Regular updates: Updated continuously as new species are sequenced and described
- Cross-references: Links organisms to all associated sequence data
- Merged nodes: Handles taxonomic ID changes and reorganization over time
Accession2TaxID System
NCBI's accession-to-taxonomy mapping system enables sequence-level taxonomic annotation:
- Accession coverage: Maps sequence accessions to taxonomic IDs
- Multiple databases: Separate files for different sequence types
- Historical data: Includes withdrawn sequences for backward compatibility
- File structure: Tab-delimited format with accession, version, taxid, gi columns
BBTools Optimization
BBTools requires specific optimizations for efficient taxonomic operations:
- Binary tree format: Faster than text-based hierarchies
- Integer arrays: Direct indexing rather than hash table lookups
- Compressed storage: Reduces memory footprint for large databases
- Streaming support: Can process data without loading entire databases
Memory and Performance
Memory Requirements by Stage
- Download phase: Minimal memory, primarily network and disk I/O
- ShrinkAccession: Low memory per process, but 6-7 parallel processes
- TaxTree generation: 16GB peak memory for tree construction
- GiTable generation: 24GB peak memory for integer array creation
- Pattern analysis: Low memory, file scanning operation
Processing Time Estimates
- Download phase: 10-60 minutes depending on network speed
- Accession processing: 20-40 minutes for compression and format conversion
- Tree generation: 5-15 minutes for taxonomic hierarchy construction
- GI table generation: 30-90 minutes for large integer array construction
- Pattern analysis: 5-10 minutes for file scanning
- Total pipeline time: 1.5-3 hours for complete execution
Disk Space Requirements
- Raw downloads: 8-12 GB for all accession2taxid files
- Processed files: 4-6 GB after compression and optimization
- Working space: Additional 2-4 GB for temporary files during processing
- Final database: 3-5 GB for complete taxonomic infrastructure
Troubleshooting
Download Issues
- Network timeouts: NCBI FTP server may be slow or temporarily unavailable
- Partial downloads: Check file sizes against expected values
- Connection limits: Some networks limit simultaneous connections
- File corruption: Re-run pipeline if checksums don't match
Memory Issues
- TaxTree failures: Reduce -Xmx16g if system has insufficient RAM
- GiTable failures: GI table generation requires substantial memory (24GB minimum recommended)
- System swapping: Ensure adequate physical RAM to avoid swap thrashing
- Java heap errors: Monitor system memory during processing phases
Processing Failures
- Parallel process failures: One failed download will cause pipeline exit
- Disk space exhaustion: Monitor disk usage during processing
- Permission errors: Ensure write permissions in working directory
- Tool dependencies: Verify all BBTools components are properly installed
Data Integrity
- File size verification: Compare generated file sizes to expected ranges
- Content validation: Check that tree.taxtree.gz and gitable.int1d.gz load properly
- Pattern analysis: Review patterns.txt for anomalies in accession data
- Test integration: Verify generated files work with taxonomic BBTools programs
Database Maintenance
Regular Updates
NCBI taxonomy is updated regularly, requiring periodic database regeneration:
- Update frequency: Monthly updates recommended for active research
- Incremental changes: New species additions, taxonomic reclassifications
- Accession additions: New sequence submissions require updated mappings
- Backward compatibility: Old files remain functional but may lack newest data
Version Control
- Timestamp tracking: Note download date for reproducibility
- File versioning: Consider backing up working databases before updates
- Consistency checking: Verify all files are from the same update cycle
- Change documentation: Track significant taxonomic changes between updates
Storage Management
- Archive old versions: Keep previous databases for reproducibility
- Compression efficiency: Use maximum compression for archived versions
- Access patterns: Most recent databases should be on fastest storage
- Cleanup automation: Pipeline automatically removes unnecessary temporary files
Example Workflows
Initial Setup for New System
# 1. Create dedicated taxonomy directory
mkdir -p /data/taxonomy/ncbi
cd /data/taxonomy/ncbi
# 2. Run fetch taxonomy pipeline
bash /path/to/bbtools/pipelines/fetch/fetchTaxonomy.sh
# 3. Set environment variable for other BBTools
export TAXPATH=/data/taxonomy/ncbi
# 4. Test with a simple taxonomic tool
taxonomy.sh tree=tree.taxtree.gz homo_sapiens
Updating Existing Database
# 1. Backup current database
cp -r /data/taxonomy/ncbi /data/taxonomy/ncbi_backup_$(date +%Y%m%d)
# 2. Clean working directory
cd /data/taxonomy/ncbi
rm -f shrunk.* tree.taxtree.gz gitable.int1d.gz patterns.txt
# 3. Run update
bash fetchTaxonomy.sh
# 4. Verify database integrity
du -h tree.taxtree.gz gitable.int1d.gz # Check file sizes are reasonable
taxonomy.sh tree=tree.taxtree.gz escherichia_coli # Test functionality
Integrating with Analysis Pipeline
# 1. Set taxonomy path for session
export TAXPATH=/data/taxonomy/ncbi
# 2. Use in sequence classification
sendsketch.sh in=unknown_sequences.fa.gz tree=auto
# 3. Use in filtering pipeline
filterbytaxa.sh in=mixed_sequences.fa.gz out=bacteria_only.fa.gz \
include tree=auto level=superkingdom id=bacteria
# 4. Use in mapping with taxonomic annotation
bbmap.sh in=reads.fq.gz ref=reference.fa.gz \
taxpath=auto printunmappedcount
Advanced Configuration
Customizing Thread Usage
For systems with different CPU configurations, you can modify the shrinkaccession thread counts in the script:
- High-CPU systems: Increase thread counts for faster processing
- Low-CPU systems: Decrease thread counts to prevent overload
- Memory-bound systems: Reduce parallel downloads to lower peak memory
Memory Optimization
For systems with memory constraints:
- TaxTree: Can reduce from 16GB but may increase processing time
- GiTable: 24GB is typically minimum for complete GI number space
- Sequential processing: Process accession files one at a time instead of parallel
Network Configuration
- Proxy support: Configure wget proxy settings if needed
- Connection limits: Some networks restrict simultaneous FTP connections
- Bandwidth management: Use wget rate limiting on shared connections
Notes and Considerations
- One-time setup: This pipeline typically needs to be run only once per system, then periodically for updates
- Resource intensive: Requires substantial computational resources and time
- Network dependency: Requires reliable internet connection to NCBI FTP servers
- System compatibility: Works on any system with BBTools, wget, and unzip
- Error tolerance: Pipeline stops on first failure to prevent corrupted databases
- Cleanup automation: Automatically removes intermediate files to save disk space
- Performance monitoring: Key steps are timed for performance analysis
- Thread optimization: Different thread counts optimized for different file sizes
- Compression efficiency: Maximum compression reduces storage requirements
- Legacy support: Maintains compatibility with older sequence files using GI numbers