Fetch Taxonomy Pipeline

Script: fetchTaxonomy.sh Source Directory: pipelines/fetch/ Author: Brian Bushnell Last Updated: August 7, 2019

Automated pipeline for downloading and processing the latest NCBI taxonomy databases including accession2taxid files and taxonomic tree data. Creates optimized taxonomic databases for use with BBTools taxonomic classification and filtering tools.

Overview

The Fetch Taxonomy pipeline automates the complex process of downloading and formatting NCBI's complete taxonomy database for use with BBTools. This includes accession-to-taxid mapping files, taxonomic tree structures, and optimized lookup tables that enable BBTools to perform rapid taxonomic classification and filtering operations.

The pipeline downloads seven different accession2taxid files covering nucleotide sequences, protein sequences, whole genome shotgun data, and protein database entries. It then processes these files along with the NCBI taxonomic tree to create optimized data structures for fast taxonomic queries.

Note: This pipeline creates the core taxonomic infrastructure required by many BBTools programs. The generated files can be referenced using "taxpath=X" where X is the location of the files generated by this script.

Prerequisites

System Requirements

Dependencies

Pipeline Stages

1. Accession2TaxID Download and Processing Phase

Downloads and processes seven different types of accession-to-taxonomy ID mapping files from NCBI in parallel:

1.1 Dead Nucleotide Sequences

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_nucl.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.dead_nucl.accession2taxid.gz zl=9 t=4 &

Downloads and processes withdrawn/obsolete nucleotide sequence accessions with compression level 9 using 4 threads.

1.2 Dead Protein Sequences

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_prot.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.dead_prot.accession2taxid.gz zl=9 t=6 &

Downloads and processes withdrawn/obsolete protein sequence accessions using 6 threads for processing.

1.3 Dead Whole Genome Shotgun

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_wgs.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.dead_wgs.accession2taxid.gz zl=9 t=6 &

Downloads and processes withdrawn whole genome shotgun accessions using 6 threads.

1.4 GenBank Nucleotide Sequences

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.nucl_gb.accession2taxid.gz zl=9 t=8 &

Downloads and processes current GenBank nucleotide sequence accessions using 8 threads for higher throughput on this large dataset.

1.5 Whole Genome Shotgun Nucleotide

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.nucl_wgs.accession2taxid.gz zl=9 t=10 &

Downloads and processes current whole genome shotgun nucleotide accessions using 10 threads due to the large size of this dataset.

1.6 Protein Data Bank (PDB)

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/pdb.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.pdb.accession2taxid.gz zl=9 t=4 &

Downloads and processes Protein Data Bank accessions using 4 threads (smaller dataset).

1.7 Protein Sequences

wget -q -O - ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz | \
    shrinkaccession.sh in=stdin.txt.gz out=shrunk.prot.accession2taxid.gz zl=9 t=10

Downloads and processes current protein sequence accessions using 10 threads. This step runs in foreground to ensure all parallel downloads complete before proceeding.

2. Taxonomic Tree Download Phase

2.1 Taxonomy Dump Download

wget -nv ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip

Downloads the complete NCBI taxonomic database dump containing names.dmp, nodes.dmp, merged.dmp and other taxonomic structure files.

2.2 Archive Extraction

unzip -o taxdmp.zip

Extracts the taxonomic dump files, overwriting any existing files to ensure fresh data.

3. Database Generation Phase

3.1 Taxonomic Tree Construction

time taxtree.sh names.dmp nodes.dmp merged.dmp tree.taxtree.gz -Xmx16g

Creates the optimized taxonomic tree structure (tree.taxtree.gz) from the NCBI taxonomy files using 16GB RAM. This tree enables rapid taxonomic hierarchy traversal and lookup operations.

3.2 GI Table Construction

time gitable.sh shrunk.dead_nucl.accession2taxid.gz,shrunk.dead_prot.accession2taxid.gz,\
shrunk.dead_wgs.accession2taxid.gz,shrunk.nucl_gb.accession2taxid.gz,\
shrunk.nucl_wgs.accession2taxid.gz,shrunk.pdb.accession2taxid.gz,\
shrunk.prot.accession2taxid.gz gitable.int1d.gz -Xmx24g

Creates the optimized GI number lookup table (gitable.int1d.gz) from all processed accession2taxid files using 24GB RAM. This enables rapid conversion from GI numbers to taxonomic IDs.

3.3 Accession Pattern Analysis

time analyzeaccession.sh shrunk.*.accession2taxid.gz out=patterns.txt

Analyzes all processed accession files to identify patterns for compression optimization and creates a patterns report.

4. Cleanup Phase

Removes temporary and intermediate files to conserve disk space:

Basic Usage

# 1. Navigate to a working directory with sufficient space
cd /path/to/taxonomy/directory

# 2. Run the fetch taxonomy pipeline
bash fetchTaxonomy.sh

# 3. Generated files will be ready for use with BBTools taxonomic programs
Warning: This pipeline downloads several GB of data and requires significant processing time and memory. Ensure adequate system resources before starting.

Output Files

Core Taxonomic Database Files

Processed Accession Files

Downloaded Raw Files (Cleaned Up)

Algorithm Details

Parallel Download Strategy

The pipeline uses parallel downloads with streaming compression to optimize both network and disk usage:

Stream Processing Architecture

Each accession2taxid file is processed using a streaming approach:

Thread Allocation Strategy

Thread counts are optimized based on file characteristics:

Database Optimization Process

Accession File Compression

ShrinkAccession tool removes unnecessary columns and optimizes format:

Taxonomic Tree Generation

TaxTree creates an optimized tree structure from NCBI dump files:

GI Table Creation

GiTable builds optimized lookup structures for legacy GI number support:

Performance Characteristics

Data Sources

NCBI Accession2TaxID Files

The pipeline downloads seven categories of accession-to-taxonomy mappings:

Active Database Files

Historical/Dead Database Files

NCBI Taxonomy Dump (taxdmp.zip)

Contains the complete NCBI taxonomic hierarchy:

Integration with BBTools

Using Generated Databases

The generated files enable taxonomic functionality in BBTools programs. Set the taxonomy path using:

Basic Taxonomy Path Setting

# For most BBTools programs
tool.sh in=sequences.fq taxpath=/path/to/taxonomy/files

# For automatic tree loading
tool.sh in=sequences.fq taxpath=/path/to/taxonomy/files tree=auto

Programs That Use These Files

File Locations for BBTools

After running this pipeline, BBTools programs can find taxonomy files using:

Database Background

NCBI Taxonomy Database

The NCBI Taxonomy database is the authoritative source for organismal taxonomy used by molecular databases:

Accession2TaxID System

NCBI's accession-to-taxonomy mapping system enables sequence-level taxonomic annotation:

BBTools Optimization

BBTools requires specific optimizations for efficient taxonomic operations:

Memory and Performance

Memory Requirements by Stage

Processing Time Estimates

Disk Space Requirements

Troubleshooting

Download Issues

Memory Issues

Processing Failures

Data Integrity

Database Maintenance

Regular Updates

NCBI taxonomy is updated regularly, requiring periodic database regeneration:

Version Control

Storage Management

Example Workflows

Initial Setup for New System

# 1. Create dedicated taxonomy directory
mkdir -p /data/taxonomy/ncbi
cd /data/taxonomy/ncbi

# 2. Run fetch taxonomy pipeline
bash /path/to/bbtools/pipelines/fetch/fetchTaxonomy.sh

# 3. Set environment variable for other BBTools
export TAXPATH=/data/taxonomy/ncbi

# 4. Test with a simple taxonomic tool
taxonomy.sh tree=tree.taxtree.gz homo_sapiens

Updating Existing Database

# 1. Backup current database
cp -r /data/taxonomy/ncbi /data/taxonomy/ncbi_backup_$(date +%Y%m%d)

# 2. Clean working directory
cd /data/taxonomy/ncbi
rm -f shrunk.* tree.taxtree.gz gitable.int1d.gz patterns.txt

# 3. Run update
bash fetchTaxonomy.sh

# 4. Verify database integrity
du -h tree.taxtree.gz gitable.int1d.gz  # Check file sizes are reasonable
taxonomy.sh tree=tree.taxtree.gz escherichia_coli  # Test functionality

Integrating with Analysis Pipeline

# 1. Set taxonomy path for session
export TAXPATH=/data/taxonomy/ncbi

# 2. Use in sequence classification
sendsketch.sh in=unknown_sequences.fa.gz tree=auto

# 3. Use in filtering pipeline
filterbytaxa.sh in=mixed_sequences.fa.gz out=bacteria_only.fa.gz \
    include tree=auto level=superkingdom id=bacteria

# 4. Use in mapping with taxonomic annotation
bbmap.sh in=reads.fq.gz ref=reference.fa.gz \
    taxpath=auto printunmappedcount

Advanced Configuration

Customizing Thread Usage

For systems with different CPU configurations, you can modify the shrinkaccession thread counts in the script:

Memory Optimization

For systems with memory constraints:

Network Configuration

Notes and Considerations