fetchRefSeq.sh
Pipeline script for downloading complete RefSeq genome sequences from NCBI FTP servers and renaming them with taxonomic identification numbers. Designed for creating comprehensive reference databases for metagenomic analysis, phylogenetic studies, and sequence classification workflows.
Overview
fetchRefSeq.sh is a specialized pipeline script that automates the process of downloading all complete genome sequences from NCBI's RefSeq database and preprocessing them for use with BBTools taxonomy-aware applications. The script combines high-speed parallel downloading with taxonomic header renaming to create ready-to-use reference databases.
Key Features
- Complete RefSeq Download: Downloads all complete genomic sequences from NCBI RefSeq release
- Taxonomic Integration: Automatically renames sequence headers with NCBI taxonomy IDs
- Parallel Processing: Uses pigz for high-speed parallel compression during processing
- Error Handling: Tracks and reports problematic headers that cannot be taxonomically classified
- NERSC Optimized: Originally designed for NERSC supercomputing environment but adaptable to other systems
Prerequisites
Required Software
- pigz: Parallel gzip implementation for high-speed compression
- wget: For downloading files from NCBI FTP servers
- BBTools taxonomy data: Updated taxonomy database files
System Requirements
- Memory: Minimum 1GB RAM (configurable via gi2taxid.sh parameters)
- Storage: Substantial free space for complete RefSeq database (typically 100GB+)
- Network: Stable high-bandwidth connection to NCBI FTP servers
- Environment: Unix-like system with bash shell
Taxonomy Database Setup
Critical: The taxonomy server must be updated before running this script, or you must have local taxonomy data available. The script uses the TAXPATH variable to locate taxonomy files:
# Default setting (auto-detect)
TAXPATH="auto"
# Custom path for non-NERSC environments
TAXPATH="/path/to/taxonomy_directory/"
Usage
Basic Execution
# Run the complete RefSeq download and processing pipeline
./fetchRefSeq.sh
Environment Configuration
For systems outside of NERSC, modify the script to set the correct taxonomy path:
# Edit the TAXPATH variable in fetchRefSeq.sh
TAXPATH="/your/taxonomy/directory/"
# Ensure pigz is available in your PATH
module load pigz # On HPC systems
# or
export PATH=$PATH:/path/to/pigz # For custom installations
Pipeline Process
Download Phase
The script downloads all complete genomic FASTA files from NCBI RefSeq using a streaming approach:
wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/*genomic.fna.gz
Processing Phase
Downloaded sequences are immediately processed through gi2taxid.sh with the following parameters:
- Memory allocation: 1GB RAM (-Xmx1g)
- Input: Streamed from wget via stdin
- Output: renamed.fa.gz (taxonomically labeled sequences)
- Compression: 16-thread parallel gzip with compression level 9
- Error tracking: Up to 5000 bad headers logged to badHeaders.txt
- BGZip support: Uses bgzip format for better random access
Taxonomy Integration
The gi2taxid.sh component performs several critical functions:
- Header parsing: Extracts GI numbers and accession IDs from NCBI headers
- Taxonomy lookup: Maps identifiers to NCBI taxonomy IDs using local or server-based databases
- Header rewriting: Generates new headers with taxonomy information embedded
- Quality control: Separates sequences with successful vs. failed taxonomy mapping
Output Files
Primary Output
- renamed.fa.gz
- Main output file containing all RefSeq sequences with taxonomically-enhanced headers. Compressed using bgzip format for efficient random access and compatibility with genomics tools.
Error Tracking
- badHeaders.txt
- Log file containing sequence headers that could not be successfully mapped to taxonomy IDs. Used for quality control and troubleshooting taxonomy database issues.
Header Format
Successfully processed sequences receive headers in the format:
>tid|12345|original_header_information
# Where 12345 is the NCBI taxonomy ID
Configuration Options
Environment Variables
- TAXPATH
- Path to BBTools taxonomy database files. Set to "auto" for automatic detection on NERSC, or specify custom path for other environments.
gi2taxid.sh Parameters
The pipeline passes specific parameters to gi2taxid.sh for optimal performance:
- -Xmx1g
- Java memory allocation set to 1GB. Increase for better performance with large datasets or if memory errors occur.
- pigz=16
- Uses 16 parallel threads for compression. Adjust based on available CPU cores.
- unpigz
- Enables parallel decompression for input processing.
- zl=9
- Maximum compression level for output files. Reduces storage requirements at cost of processing time.
- server
- Enables server-based taxonomy lookup for accession-based identification when local files are unavailable.
- maxbadheaders=5000
- Maximum number of problematic headers to log before truncating error file. Prevents excessive disk usage from systematic failures.
- bgzip
- Outputs in bgzip format for improved compatibility with genomics analysis tools and random access capabilities.
Customization
Memory Adjustment
# For systems with limited memory, reduce allocation
wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/*genomic.fna.gz | \
gi2taxid.sh -Xmx512m in=stdin.fa.gz out=renamed.fa.gz pigz=8 unpigz zl=6 server ow maxbadheaders=5000 badheaders=badHeaders.txt bgzip
Thread Optimization
# Adjust pigz threads based on available CPU cores
wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/*genomic.fna.gz | \
gi2taxid.sh -Xmx1g in=stdin.fa.gz out=renamed.fa.gz pigz=32 unpigz zl=9 server ow maxbadheaders=5000 badheaders=badHeaders.txt bgzip
Partial Downloads
# Download specific taxonomic groups instead of complete RefSeq
wget -q -O - "ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/*genomic.fna.gz" | \
gi2taxid.sh -Xmx1g in=stdin.fa.gz out=bacteria_renamed.fa.gz pigz=16 unpigz zl=9 server ow maxbadheaders=5000 badheaders=badHeaders.txt bgzip
Troubleshooting
Common Issues
Network Connectivity
- Timeout errors: NCBI FTP servers may have rate limiting or temporary unavailability
- Incomplete downloads: Large file transfers may be interrupted; consider resumable download tools
- DNS resolution: Ensure ftp.ncbi.nlm.nih.gov resolves correctly in your environment
Taxonomy Database Issues
- High bad header counts: Usually indicates outdated or missing taxonomy files
- Server connection failures: Check network access to taxonomy server when using server mode
- Memory errors: Increase -Xmx allocation if processing fails due to insufficient memory
System Resource Constraints
- Disk space: Complete RefSeq requires substantial storage; monitor available space
- CPU utilization: Adjust pigz thread count to avoid overwhelming system resources
- Memory pressure: Reduce Java heap size if system memory is limited
Diagnostic Commands
# Check taxonomy database availability
ls -la $TAXPATH
# Test pigz installation
pigz --version
# Verify network connectivity to NCBI
wget -q --spider ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/
# Monitor download progress
tail -f badHeaders.txt
Performance Considerations
Download Speed
- Network bandwidth: Complete RefSeq download benefits from high-bandwidth connections
- NCBI server load: Download speeds vary with server utilization; consider off-peak hours
- Geographic location: Proximity to NCBI servers affects transfer rates
Processing Throughput
- CPU cores: Pigz compression scales well with additional CPU cores
- Memory allocation: Larger heap sizes improve taxonomy lookup performance
- Storage I/O: Fast storage (SSD) significantly improves compression and write performance
Optimization Strategies
- Parallel processing: Consider splitting downloads by taxonomic group for parallel execution
- Local caching: Download taxonomy databases locally to reduce server dependency
- Incremental updates: Use rsync or similar tools for incremental RefSeq updates rather than full downloads
Integration with BBTools
Downstream Applications
The taxonomically-labeled RefSeq database created by fetchRefSeq.sh serves as input for various BBTools applications:
- BBMap/BBSplit: Reference-based read mapping and separation
- Seal: Sequence matching and contamination detection
- SendSketch: Rapid taxonomic classification via k-mer sketching
- CompareSketch: Comparative genomic analysis and phylogenetic placement
- Taxonomy tools: Various classification and analysis workflows
Database Management
# Create searchable index for BBMap
bbmap.sh ref=renamed.fa.gz
# Generate k-mer sketches for rapid comparison
sketch.sh in=renamed.fa.gz out=refseq_sketches mode=sequence
# Extract specific taxonomic groups
filterbyname.sh in=renamed.fa.gz out=bacteria.fa.gz names=bacteria.txt include=t
Algorithm Details
Streaming Architecture
The pipeline employs a streaming architecture that minimizes disk I/O and memory usage:
- Pipeline processing: Data flows directly from wget through gi2taxid without intermediate files
- Memory efficiency: Only processes sequences currently in memory, not entire dataset
- Parallel compression: Pigz operates on compressed blocks while download continues
- Buffered I/O: Uses optimized buffer sizes for network and disk operations
Taxonomy Mapping Algorithm
The RenameGiToTaxid Java class implements several mapping strategies:
- GI number lookup: Direct mapping via GI-to-TaxID tables
- Accession parsing: Extraction and lookup of NCBI accession numbers
- Server fallback: Online taxonomy server queries for missing local mappings
- Error tracking: Systematic logging of unmappable sequences for quality control
Compression Strategy
The pipeline uses advanced compression techniques for optimal storage efficiency:
- BGZip format: Block-based compression allowing random access
- Parallel compression: Multi-threaded pigz for high-throughput processing
- Adaptive buffering: Dynamic buffer sizing based on system resources
- Quality preservation: Lossless compression maintains sequence integrity
Error Handling
Robust error handling ensures reliable operation even with problematic input data:
- Bad header tolerance: Configurable threshold for unmappable sequences
- Network resilience: Automatic retry logic for transient network failures
- Memory management: Automatic garbage collection and buffer management
- Process monitoring: Exit code reporting and error logging for automated workflows
Historical Context
This pipeline was originally developed for the NERSC supercomputing environment to facilitate large-scale metagenomic analysis projects. The design reflects the specific requirements of high-performance computing environments:
- Batch processing: Optimized for scheduled execution in job queue systems
- Resource efficiency: Minimal resource footprint for shared computing environments
- Network optimization: Designed for high-bandwidth, high-latency network environments
- Fault tolerance: Robust error handling for long-running, unsupervised execution
Related Tools
- gi2taxid.sh: Core component for taxonomic sequence renaming
- downloadRefSeq.sh: Alternative download script with different options
- updateTaxonomy.sh: Script for updating local taxonomy databases
- sketch.sh: K-mer sketching tool for downloaded sequences
- sendsketch.sh: Taxonomic classification using the created database
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org