FetchNt Pipeline
Comprehensive pipeline for fetching, processing, and sketching the NCBI NT (nucleotide) database. This pipeline downloads the complete NCBI NT database, processes it with taxonomic information, sorts by taxonomy, creates blacklists of common k-mers, and generates taxonomic sketches for high-speed sequence identification.
Overview
The fetchNt pipeline automates the complete process of preparing the NCBI NT database for use with BBTools taxonomic identification tools. The NT database is NCBI's comprehensive collection of nucleotide sequences from all public databases, making it ideal for broad taxonomic identification tasks.
The pipeline performs four main operations:
- Database Download and Processing: Downloads NT.gz and renames sequences with taxonomic information
- Taxonomic Sorting: Sorts sequences by taxonomy to optimize memory usage during sketching
- Blacklist Generation: Creates k-mer blacklists to filter overly common sequences
- Sketch Creation: Generates multiple sketch files with one sketch per species for fast taxonomic queries
Prerequisites
System Requirements
- High-performance computing system with substantial memory (96GB+ recommended)
- Fast network connection for downloading large databases
- Sufficient disk space (several TB for NT database and processed files)
- BBTools suite with all dependencies
- pigz for parallel compression/decompression
- Updated BBTools taxonomy data
Required Setup
- Taxonomy Data: Must be updated before running (use fetchTaxonomy.sh)
- TAXPATH Configuration: Set to point to your taxonomy directory
- Network Access: Requires access to NCBI FTP servers
- Module Loading: pigz and other dependencies must be available
SLURM Configuration
The script includes SLURM directives optimized for NERSC Genepool systems:
- Job name: sketch_refseq
- Queue: genepool
- Account: gtrqc
- Nodes: 1 (exclusive)
- Architecture: haswell
- Wall time: 71 hours
Pipeline Stages
1. Environment Setup
The pipeline begins by setting up the environment and loading necessary modules:
set -e # Exit on any error
TAXPATH="auto" # Automatic taxonomy path detection
# module load pigz # Load parallel compression (uncomment if needed)
TAXPATH="/path/to/taxonomy_directory/"
2. Database Download and Processing
Downloads the complete NCBI NT database and processes it with taxonomic renaming:
wget -q -O - ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz | \
gi2taxid.sh -Xmx1g in=stdin.fa.gz out=renamed.fa.gz pigz=32 unpigz bgzip zl=8 \
server ow shrinknames maxbadheaders=5000 badheaders=badHeaders.txt taxpath=$TAXPATH
Key Parameters:
- -Xmx1g: 1GB memory allocation for gi2taxid
- pigz=32: Use 32 threads for parallel compression
- bgzip: Use bgzip compression format
- zl=8: Compression level 8
- server: Use server mode for better performance
- shrinknames: Reduce header length to save memory
- maxbadheaders=5000: Allow up to 5000 problematic headers
- badheaders=badHeaders.txt: Log problematic headers to file
3. Taxonomic Sorting
Sorts sequences by taxonomy to optimize memory usage during downstream processing:
sortbyname.sh -Xmx96g in=renamed.fa.gz out=sorted.fa.gz ow taxa tree=auto \
fastawrap=1023 zl=9 pigz=32 minlen=60 bgzip unbgzip
Benefits of Taxonomic Sorting:
- Enables sketches to be written to disk as soon as they're completed
- Dramatically reduces peak memory usage during sketch generation
- Improves overall pipeline efficiency
Key Parameters:
- -Xmx96g: 96GB memory allocation for sorting
- taxa: Sort by taxonomic classification
- tree=auto: Automatically detect taxonomy tree
- fastawrap=1023: Wrap FASTA lines at 1023 characters
- zl=9: Maximum compression level
- minlen=60: Filter sequences shorter than 60bp
- bgzip/unbgzip: Handle bgzip format properly
4. K-mer Blacklist Generation
Creates a blacklist of k-mers that occur in too many different genera, which helps improve specificity:
sketchblacklist.sh -Xmx31g in=sorted.fa.gz prepasses=1 tree=auto taxa \
taxlevel=genus ow out=blacklist_nt_genus_100.sketch mincount=120 k=32,24 taxpath=$TAXPATH
Blacklist Purpose:
- Filters k-mers present in ≥100 different genera
- Removes overly conserved sequences that provide little taxonomic resolution
- Improves specificity and reduces false positive identifications
- Reduces computational overhead by eliminating uninformative k-mers
Key Parameters:
- -Xmx31g: 31GB memory allocation
- prepasses=1: Use one preprocessing pass
- taxlevel=genus: Count occurrences at genus level
- mincount=120: Blacklist k-mers in ≥120 taxa (allowing some buffer above 100)
- k=32,24: Generate blacklists for both k=32 and k=24
5. Taxonomic Sketch Generation
Generates the final taxonomic sketches used for sequence identification:
bbsketch.sh -Xmx31g in=sorted.fa.gz out=taxa#.sketch mode=taxa tree=auto files=31 \
ow unpigz minsize=300 prefilter autosize blacklist=blacklist_nt_genus_100.sketch \
k=32,24 depth taxpath=$TAXPATH
Sketch Organization:
- Creates 31 separate sketch files (taxa0.sketch through taxa30.sketch)
- Each sketch file contains multiple species
- One sketch per species for precise taxonomic identification
- Multiple files enable faster loading on multicore systems
Key Parameters:
- -Xmx31g: 31GB memory allocation
- mode=taxa: Create taxonomic sketches
- files=31: Split output into 31 files
- minsize=300: Minimum sketch size of 300 hashes
- prefilter: Use prefiltering for memory efficiency
- autosize: Automatically determine sketch sizes
- depth: Include depth information in sketches
- k=32,24: Use both k=32 and k=24 for analysis
Usage Examples
Basic Pipeline Execution
# 1. Ensure taxonomy is updated
fetchTaxonomy.sh
# 2. Configure taxonomy path (if needed)
export TAXPATH="/path/to/your/taxonomy/"
# 3. Run the complete pipeline
sbatch fetchNt.sh
# Or run directly (not recommended for full NT database):
bash fetchNt.sh
Using the Generated Sketches
Once the pipeline completes, you can use the generated sketches for taxonomic identification:
Basic Taxonomic Identification
# Compare contigs against NT sketches
comparesketch.sh in=contigs.fa k=32,24 tree=auto taxa#.sketch \
blacklist=blacklist_nt_genus_100.sketch
Using Default NT Path (NERSC Systems)
# Set up default path (NERSC systems only)
ln -s /path/to/new/sketches /global/projectb/sandbox/gaag/bbtools/nt/current
# Then use simplified syntax
comparesketch.sh in=contigs.fa nt tree=auto
Resource Requirements and Runtime
Computational Requirements
Stage | Memory | CPU Threads | Approximate Time |
---|---|---|---|
Download & Processing | 1GB | 32 (pigz) | 2-6 hours |
Taxonomic Sorting | 96GB | 32 (pigz) | 8-12 hours |
Blacklist Generation | 31GB | Variable | 4-8 hours |
Sketch Generation | 31GB | Variable | 12-24 hours |
Disk Space Requirements
- NT Database: ~100-200GB compressed
- Renamed Database: ~150-300GB compressed
- Sorted Database: ~150-300GB compressed
- Sketch Files: ~1-10GB total
- Blacklist Files: ~100MB-1GB
- Temporary Files: Additional 50-100GB during processing
Network Requirements
- Stable connection to NCBI FTP servers
- Ability to download 100-200GB of data
- Bandwidth sufficient for large file transfers
Output Files
- renamed.fa.gz - NT database with taxonomic renaming applied
- sorted.fa.gz - Taxonomically sorted NT database
- blacklist_nt_genus_100.sketch - K-mer blacklist for filtering common sequences
- taxa0.sketch through taxa30.sketch - 31 taxonomic sketch files
- badHeaders.txt - Log of problematic sequence headers encountered
- log_*.out, log_*.err - SLURM job output and error logs
Configuration and Customization
Taxonomy Path Configuration
For systems outside NERSC, modify the TAXPATH variable:
# Default (auto-detection)
TAXPATH="auto"
# Custom path
TAXPATH="/path/to/taxonomy_directory/"
# Environment variable
export TAXPATH="/custom/taxonomy/location/"
Memory Optimization
Memory allocations can be adjusted based on available resources:
- gi2taxid: Can use more than 1GB if available
- sortbyname: Requires substantial memory (96GB recommended minimum)
- sketchblacklist/bbsketch: Can be reduced to 16GB if necessary
Compression Settings
Compression parameters can be tuned for different priorities:
- Higher compression (slower): zl=9
- Faster processing (larger files): zl=6
- Thread count: Adjust pigz parameter based on available cores
Troubleshooting
Common Issues
- Out of Memory: Reduce memory allocations or use systems with more RAM
- Download Failures: Check network connectivity and NCBI server status
- Taxonomy Errors: Ensure taxonomy data is current using fetchTaxonomy.sh
- Disk Space: Monitor available space throughout pipeline execution
Performance Optimization
- Use fast local storage (not network filesystems) when possible
- Ensure adequate memory to avoid swapping
- Monitor system resources during execution
- Consider running stages separately for better control
Validation
After completion, validate the sketches:
# Test with a known sequence
comparesketch.sh in=test_sequence.fa taxa#.sketch blacklist=blacklist_nt_genus_100.sketch
# Check sketch file sizes and counts
ls -la taxa*.sketch
Integration with BBTools Ecosystem
Related Tools
- fetchTaxonomy.sh: Updates taxonomy database (prerequisite)
- comparesketch.sh: Uses generated sketches for identification
- sendsketch.sh: Alternative sketch-based identification
- gi2taxid.sh: Taxonomic renaming component
- sortbyname.sh: Taxonomic sorting component
- sketchblacklist.sh: Blacklist generation component
- bbsketch.sh: Sketch generation component
Workflow Integration
The fetchNt pipeline is typically part of a larger database preparation workflow:
- Update taxonomy:
fetchTaxonomy.sh
- Process NT database:
fetchNt.sh
- Set up server (optional):
startNtServerVM.sh
- Perform analyses:
comparesketch.sh
,sendsketch.sh
Algorithm Details
Taxonomic Sketch Strategy
The pipeline implements a sophisticated approach to taxonomic sketching:
- Species-Level Resolution: One sketch per species provides precise identification
- Multi-K Strategy: Uses k=32 and k=24 for sensitivity/specificity balance
- Blacklist Filtering: Removes k-mers present in ≥100 genera to improve specificity
- Memory Optimization: Taxonomic sorting enables streaming sketch generation
Database Processing Strategy
The pipeline's processing order is optimized for efficiency:
- Streaming Download: Processes data while downloading to minimize disk usage
- Taxonomic Integration: Renames sequences with taxonomic information early
- Quality Filtering: Removes short sequences and problematic headers
- Taxonomic Sorting: Groups related sequences for efficient processing
Performance Optimizations
- Parallel Processing: Uses pigz for multi-threaded compression
- Memory Management: Streaming operations where possible
- I/O Optimization: bgzip format for better compression and speed
- File Organization: Multiple sketch files for parallel loading