FetchNt Pipeline

Overview

The fetchNt pipeline automates the complete process of preparing the NCBI NT database for use with BBTools taxonomic identification tools. The NT database is NCBI's comprehensive collection of nucleotide sequences from all public databases, making it ideal for broad taxonomic identification tasks.

The pipeline performs four main operations:

Database Download and Processing: Downloads NT.gz and renames sequences with taxonomic information
Taxonomic Sorting: Sorts sequences by taxonomy to optimize memory usage during sketching
Blacklist Generation: Creates k-mer blacklists to filter overly common sequences
Sketch Creation: Generates multiple sketch files with one sketch per species for fast taxonomic queries

Important: This pipeline requires significant computational resources and time. It's designed to run on high-performance computing systems and may take many hours to complete. Ensure taxonomy data is updated before running this pipeline.

Prerequisites

System Requirements

High-performance computing system with substantial memory (96GB+ recommended)
Fast network connection for downloading large databases
Sufficient disk space (several TB for NT database and processed files)
BBTools suite with all dependencies
pigz for parallel compression/decompression
Updated BBTools taxonomy data

Required Setup

Taxonomy Data: Must be updated before running (use fetchTaxonomy.sh)
TAXPATH Configuration: Set to point to your taxonomy directory
Network Access: Requires access to NCBI FTP servers
Module Loading: pigz and other dependencies must be available

SLURM Configuration

The script includes SLURM directives optimized for NERSC Genepool systems:

Job name: sketch_refseq
Queue: genepool
Account: gtrqc
Nodes: 1 (exclusive)
Architecture: haswell
Wall time: 71 hours

Pipeline Stages

1. Environment Setup

The pipeline begins by setting up the environment and loading necessary modules:

set -e                    # Exit on any error
TAXPATH="auto"            # Automatic taxonomy path detection
# module load pigz        # Load parallel compression (uncomment if needed)

Configuration Note: For systems outside NERSC, modify the TAXPATH variable to point to your BBTools taxonomy directory, e.g., TAXPATH="/path/to/taxonomy_directory/"

2. Database Download and Processing

Downloads the complete NCBI NT database and processes it with taxonomic renaming:

wget -q -O - ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz | \
gi2taxid.sh -Xmx1g in=stdin.fa.gz out=renamed.fa.gz pigz=32 unpigz bgzip zl=8 \
server ow shrinknames maxbadheaders=5000 badheaders=badHeaders.txt taxpath=$TAXPATH

Key Parameters:

-Xmx1g: 1GB memory allocation for gi2taxid
pigz=32: Use 32 threads for parallel compression
bgzip: Use bgzip compression format
zl=8: Compression level 8
server: Use server mode for better performance
shrinknames: Reduce header length to save memory
maxbadheaders=5000: Allow up to 5000 problematic headers
badheaders=badHeaders.txt: Log problematic headers to file

3. Taxonomic Sorting

Sorts sequences by taxonomy to optimize memory usage during downstream processing:

sortbyname.sh -Xmx96g in=renamed.fa.gz out=sorted.fa.gz ow taxa tree=auto \
fastawrap=1023 zl=9 pigz=32 minlen=60 bgzip unbgzip

Benefits of Taxonomic Sorting:

Enables sketches to be written to disk as soon as they're completed
Dramatically reduces peak memory usage during sketch generation
Improves overall pipeline efficiency

Key Parameters:

-Xmx96g: 96GB memory allocation for sorting
taxa: Sort by taxonomic classification
tree=auto: Automatically detect taxonomy tree
fastawrap=1023: Wrap FASTA lines at 1023 characters
zl=9: Maximum compression level
minlen=60: Filter sequences shorter than 60bp
bgzip/unbgzip: Handle bgzip format properly

4. K-mer Blacklist Generation

Creates a blacklist of k-mers that occur in too many different genera, which helps improve specificity:

sketchblacklist.sh -Xmx31g in=sorted.fa.gz prepasses=1 tree=auto taxa \
taxlevel=genus ow out=blacklist_nt_genus_100.sketch mincount=120 k=32,24 taxpath=$TAXPATH

Blacklist Purpose:

Filters k-mers present in ≥100 different genera
Removes overly conserved sequences that provide little taxonomic resolution
Improves specificity and reduces false positive identifications
Reduces computational overhead by eliminating uninformative k-mers

Key Parameters:

-Xmx31g: 31GB memory allocation
prepasses=1: Use one preprocessing pass
taxlevel=genus: Count occurrences at genus level
mincount=120: Blacklist k-mers in ≥120 taxa (allowing some buffer above 100)
k=32,24: Generate blacklists for both k=32 and k=24

5. Taxonomic Sketch Generation

Generates the final taxonomic sketches used for sequence identification:

bbsketch.sh -Xmx31g in=sorted.fa.gz out=taxa#.sketch mode=taxa tree=auto files=31 \
ow unpigz minsize=300 prefilter autosize blacklist=blacklist_nt_genus_100.sketch \
k=32,24 depth taxpath=$TAXPATH

Sketch Organization:

Creates 31 separate sketch files (taxa0.sketch through taxa30.sketch)
Each sketch file contains multiple species
One sketch per species for precise taxonomic identification
Multiple files enable faster loading on multicore systems

Key Parameters:

-Xmx31g: 31GB memory allocation
mode=taxa: Create taxonomic sketches
files=31: Split output into 31 files
minsize=300: Minimum sketch size of 300 hashes
prefilter: Use prefiltering for memory efficiency
autosize: Automatically determine sketch sizes
depth: Include depth information in sketches
k=32,24: Use both k=32 and k=24 for analysis

Usage Examples

Basic Pipeline Execution

# 1. Ensure taxonomy is updated
fetchTaxonomy.sh

# 2. Configure taxonomy path (if needed)
export TAXPATH="/path/to/your/taxonomy/"

# 3. Run the complete pipeline
sbatch fetchNt.sh

# Or run directly (not recommended for full NT database):
bash fetchNt.sh

Using the Generated Sketches

Once the pipeline completes, you can use the generated sketches for taxonomic identification:

Basic Taxonomic Identification

# Compare contigs against NT sketches
comparesketch.sh in=contigs.fa k=32,24 tree=auto taxa#.sketch \
blacklist=blacklist_nt_genus_100.sketch

Using Default NT Path (NERSC Systems)

# Set up default path (NERSC systems only)
ln -s /path/to/new/sketches /global/projectb/sandbox/gaag/bbtools/nt/current

# Then use simplified syntax
comparesketch.sh in=contigs.fa nt tree=auto

Note: The simplified "nt" syntax automatically includes the correct sketch files, blacklist, and k-mer values, making it the preferred method when available.

Resource Requirements and Runtime

Computational Requirements

Stage	Memory	CPU Threads	Approximate Time
Download & Processing	1GB	32 (pigz)	2-6 hours
Taxonomic Sorting	96GB	32 (pigz)	8-12 hours
Blacklist Generation	31GB	Variable	4-8 hours
Sketch Generation	31GB	Variable	12-24 hours

Disk Space Requirements

NT Database: ~100-200GB compressed
Renamed Database: ~150-300GB compressed
Sorted Database: ~150-300GB compressed
Sketch Files: ~1-10GB total
Blacklist Files: ~100MB-1GB
Temporary Files: Additional 50-100GB during processing

Network Requirements

Stable connection to NCBI FTP servers
Ability to download 100-200GB of data
Bandwidth sufficient for large file transfers

Output Files

renamed.fa.gz - NT database with taxonomic renaming applied
sorted.fa.gz - Taxonomically sorted NT database
blacklist_nt_genus_100.sketch - K-mer blacklist for filtering common sequences
taxa0.sketch through taxa30.sketch - 31 taxonomic sketch files
badHeaders.txt - Log of problematic sequence headers encountered
log_*.out, log_*.err - SLURM job output and error logs

Configuration and Customization

Taxonomy Path Configuration

For systems outside NERSC, modify the TAXPATH variable:

# Default (auto-detection)
TAXPATH="auto"

# Custom path
TAXPATH="/path/to/taxonomy_directory/"

# Environment variable
export TAXPATH="/custom/taxonomy/location/"

Memory Optimization

Memory allocations can be adjusted based on available resources:

gi2taxid: Can use more than 1GB if available
sortbyname: Requires substantial memory (96GB recommended minimum)
sketchblacklist/bbsketch: Can be reduced to 16GB if necessary

Compression Settings

Compression parameters can be tuned for different priorities:

Higher compression (slower): zl=9
Faster processing (larger files): zl=6
Thread count: Adjust pigz parameter based on available cores

Troubleshooting

Common Issues

Out of Memory: Reduce memory allocations or use systems with more RAM
Download Failures: Check network connectivity and NCBI server status
Taxonomy Errors: Ensure taxonomy data is current using fetchTaxonomy.sh
Disk Space: Monitor available space throughout pipeline execution

Performance Optimization

Use fast local storage (not network filesystems) when possible
Ensure adequate memory to avoid swapping
Monitor system resources during execution
Consider running stages separately for better control

Validation

After completion, validate the sketches:

# Test with a known sequence
comparesketch.sh in=test_sequence.fa taxa#.sketch blacklist=blacklist_nt_genus_100.sketch

# Check sketch file sizes and counts
ls -la taxa*.sketch

Integration with BBTools Ecosystem

Related Tools

fetchTaxonomy.sh: Updates taxonomy database (prerequisite)
comparesketch.sh: Uses generated sketches for identification
sendsketch.sh: Alternative sketch-based identification
gi2taxid.sh: Taxonomic renaming component
sortbyname.sh: Taxonomic sorting component
sketchblacklist.sh: Blacklist generation component
bbsketch.sh: Sketch generation component

Workflow Integration

The fetchNt pipeline is typically part of a larger database preparation workflow:

Update taxonomy: fetchTaxonomy.sh
Process NT database: fetchNt.sh
Set up server (optional): startNtServerVM.sh
Perform analyses: comparesketch.sh, sendsketch.sh

Algorithm Details

Taxonomic Sketch Strategy

The pipeline implements a sophisticated approach to taxonomic sketching:

Species-Level Resolution: One sketch per species provides precise identification
Multi-K Strategy: Uses k=32 and k=24 for sensitivity/specificity balance
Blacklist Filtering: Removes k-mers present in ≥100 genera to improve specificity
Memory Optimization: Taxonomic sorting enables streaming sketch generation

Database Processing Strategy

The pipeline's processing order is optimized for efficiency:

Streaming Download: Processes data while downloading to minimize disk usage
Taxonomic Integration: Renames sequences with taxonomic information early
Quality Filtering: Removes short sequences and problematic headers
Taxonomic Sorting: Groups related sequences for efficient processing

Performance Optimizations

Parallel Processing: Uses pigz for multi-threaded compression
Memory Management: Streaming operations where possible
I/O Optimization: bgzip format for better compression and speed
File Organization: Multiple sketch files for parallel loading