FetchNt Pipeline

Script: fetchNt.sh Source Directory: pipelines/fetch/ Author: Brian Bushnell Last Updated: August 7, 2019

Comprehensive pipeline for fetching, processing, and sketching the NCBI NT (nucleotide) database. This pipeline downloads the complete NCBI NT database, processes it with taxonomic information, sorts by taxonomy, creates blacklists of common k-mers, and generates taxonomic sketches for high-speed sequence identification.

Overview

The fetchNt pipeline automates the complete process of preparing the NCBI NT database for use with BBTools taxonomic identification tools. The NT database is NCBI's comprehensive collection of nucleotide sequences from all public databases, making it ideal for broad taxonomic identification tasks.

The pipeline performs four main operations:

  1. Database Download and Processing: Downloads NT.gz and renames sequences with taxonomic information
  2. Taxonomic Sorting: Sorts sequences by taxonomy to optimize memory usage during sketching
  3. Blacklist Generation: Creates k-mer blacklists to filter overly common sequences
  4. Sketch Creation: Generates multiple sketch files with one sketch per species for fast taxonomic queries
Important: This pipeline requires significant computational resources and time. It's designed to run on high-performance computing systems and may take many hours to complete. Ensure taxonomy data is updated before running this pipeline.

Prerequisites

System Requirements

Required Setup

SLURM Configuration

The script includes SLURM directives optimized for NERSC Genepool systems:

Pipeline Stages

1. Environment Setup

The pipeline begins by setting up the environment and loading necessary modules:

set -e                    # Exit on any error
TAXPATH="auto"            # Automatic taxonomy path detection
# module load pigz        # Load parallel compression (uncomment if needed)
Configuration Note: For systems outside NERSC, modify the TAXPATH variable to point to your BBTools taxonomy directory, e.g., TAXPATH="/path/to/taxonomy_directory/"

2. Database Download and Processing

Downloads the complete NCBI NT database and processes it with taxonomic renaming:

wget -q -O - ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz | \
gi2taxid.sh -Xmx1g in=stdin.fa.gz out=renamed.fa.gz pigz=32 unpigz bgzip zl=8 \
server ow shrinknames maxbadheaders=5000 badheaders=badHeaders.txt taxpath=$TAXPATH

Key Parameters:

3. Taxonomic Sorting

Sorts sequences by taxonomy to optimize memory usage during downstream processing:

sortbyname.sh -Xmx96g in=renamed.fa.gz out=sorted.fa.gz ow taxa tree=auto \
fastawrap=1023 zl=9 pigz=32 minlen=60 bgzip unbgzip

Benefits of Taxonomic Sorting:

Key Parameters:

4. K-mer Blacklist Generation

Creates a blacklist of k-mers that occur in too many different genera, which helps improve specificity:

sketchblacklist.sh -Xmx31g in=sorted.fa.gz prepasses=1 tree=auto taxa \
taxlevel=genus ow out=blacklist_nt_genus_100.sketch mincount=120 k=32,24 taxpath=$TAXPATH

Blacklist Purpose:

Key Parameters:

5. Taxonomic Sketch Generation

Generates the final taxonomic sketches used for sequence identification:

bbsketch.sh -Xmx31g in=sorted.fa.gz out=taxa#.sketch mode=taxa tree=auto files=31 \
ow unpigz minsize=300 prefilter autosize blacklist=blacklist_nt_genus_100.sketch \
k=32,24 depth taxpath=$TAXPATH

Sketch Organization:

Key Parameters:

Usage Examples

Basic Pipeline Execution

# 1. Ensure taxonomy is updated
fetchTaxonomy.sh

# 2. Configure taxonomy path (if needed)
export TAXPATH="/path/to/your/taxonomy/"

# 3. Run the complete pipeline
sbatch fetchNt.sh

# Or run directly (not recommended for full NT database):
bash fetchNt.sh

Using the Generated Sketches

Once the pipeline completes, you can use the generated sketches for taxonomic identification:

Basic Taxonomic Identification

# Compare contigs against NT sketches
comparesketch.sh in=contigs.fa k=32,24 tree=auto taxa#.sketch \
blacklist=blacklist_nt_genus_100.sketch

Using Default NT Path (NERSC Systems)

# Set up default path (NERSC systems only)
ln -s /path/to/new/sketches /global/projectb/sandbox/gaag/bbtools/nt/current

# Then use simplified syntax
comparesketch.sh in=contigs.fa nt tree=auto
Note: The simplified "nt" syntax automatically includes the correct sketch files, blacklist, and k-mer values, making it the preferred method when available.

Resource Requirements and Runtime

Computational Requirements

Stage Memory CPU Threads Approximate Time
Download & Processing 1GB 32 (pigz) 2-6 hours
Taxonomic Sorting 96GB 32 (pigz) 8-12 hours
Blacklist Generation 31GB Variable 4-8 hours
Sketch Generation 31GB Variable 12-24 hours

Disk Space Requirements

Network Requirements

Output Files

Configuration and Customization

Taxonomy Path Configuration

For systems outside NERSC, modify the TAXPATH variable:

# Default (auto-detection)
TAXPATH="auto"

# Custom path
TAXPATH="/path/to/taxonomy_directory/"

# Environment variable
export TAXPATH="/custom/taxonomy/location/"

Memory Optimization

Memory allocations can be adjusted based on available resources:

Compression Settings

Compression parameters can be tuned for different priorities:

Troubleshooting

Common Issues

Performance Optimization

Validation

After completion, validate the sketches:

# Test with a known sequence
comparesketch.sh in=test_sequence.fa taxa#.sketch blacklist=blacklist_nt_genus_100.sketch

# Check sketch file sizes and counts
ls -la taxa*.sketch

Integration with BBTools Ecosystem

Related Tools

Workflow Integration

The fetchNt pipeline is typically part of a larger database preparation workflow:

  1. Update taxonomy: fetchTaxonomy.sh
  2. Process NT database: fetchNt.sh
  3. Set up server (optional): startNtServerVM.sh
  4. Perform analyses: comparesketch.sh, sendsketch.sh

Algorithm Details

Taxonomic Sketch Strategy

The pipeline implements a sophisticated approach to taxonomic sketching:

Database Processing Strategy

The pipeline's processing order is optimized for efficiency:

  1. Streaming Download: Processes data while downloading to minimize disk usage
  2. Taxonomic Integration: Renames sequences with taxonomic information early
  3. Quality Filtering: Removes short sequences and problematic headers
  4. Taxonomic Sorting: Groups related sequences for efficient processing

Performance Optimizations