fetchRefSeq.sh - BBTools Pipeline

Overview

fetchRefSeq.sh is a specialized pipeline script that automates the process of downloading all complete genome sequences from NCBI's RefSeq database and preprocessing them for use with BBTools taxonomy-aware applications. The script combines high-speed parallel downloading with taxonomic header renaming to create ready-to-use reference databases.

Key Features

Complete RefSeq Download: Downloads all complete genomic sequences from NCBI RefSeq release
Taxonomic Integration: Automatically renames sequence headers with NCBI taxonomy IDs
Parallel Processing: Uses pigz for high-speed parallel compression during processing
Error Handling: Tracks and reports problematic headers that cannot be taxonomically classified
NERSC Optimized: Originally designed for NERSC supercomputing environment but adaptable to other systems

Prerequisites

Required Software

pigz: Parallel gzip implementation for high-speed compression
wget: For downloading files from NCBI FTP servers
BBTools taxonomy data: Updated taxonomy database files

System Requirements

Memory: Minimum 1GB RAM (configurable via gi2taxid.sh parameters)
Storage: Substantial free space for complete RefSeq database (typically 100GB+)
Network: Stable high-bandwidth connection to NCBI FTP servers
Environment: Unix-like system with bash shell

Taxonomy Database Setup

Critical: The taxonomy server must be updated before running this script, or you must have local taxonomy data available. The script uses the TAXPATH variable to locate taxonomy files:

# Default setting (auto-detect)
TAXPATH="auto"

# Custom path for non-NERSC environments
TAXPATH="/path/to/taxonomy_directory/"

Usage

Basic Execution

# Run the complete RefSeq download and processing pipeline
./fetchRefSeq.sh

Environment Configuration

For systems outside of NERSC, modify the script to set the correct taxonomy path:

# Edit the TAXPATH variable in fetchRefSeq.sh
TAXPATH="/your/taxonomy/directory/"

# Ensure pigz is available in your PATH
module load pigz  # On HPC systems
# or
export PATH=$PATH:/path/to/pigz  # For custom installations

Pipeline Process

Download Phase

The script downloads all complete genomic FASTA files from NCBI RefSeq using a streaming approach:

wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/*genomic.fna.gz

Processing Phase

Downloaded sequences are immediately processed through gi2taxid.sh with the following parameters:

Memory allocation: 1GB RAM (-Xmx1g)
Input: Streamed from wget via stdin
Output: renamed.fa.gz (taxonomically labeled sequences)
Compression: 16-thread parallel gzip with compression level 9
Error tracking: Up to 5000 bad headers logged to badHeaders.txt
BGZip support: Uses bgzip format for better random access

Taxonomy Integration

The gi2taxid.sh component performs several critical functions:

Header parsing: Extracts GI numbers and accession IDs from NCBI headers
Taxonomy lookup: Maps identifiers to NCBI taxonomy IDs using local or server-based databases
Header rewriting: Generates new headers with taxonomy information embedded
Quality control: Separates sequences with successful vs. failed taxonomy mapping

Output Files

Primary Output

renamed.fa.gz: Main output file containing all RefSeq sequences with taxonomically-enhanced headers. Compressed using bgzip format for efficient random access and compatibility with genomics tools.

Error Tracking

badHeaders.txt: Log file containing sequence headers that could not be successfully mapped to taxonomy IDs. Used for quality control and troubleshooting taxonomy database issues.

Header Format

Successfully processed sequences receive headers in the format:

>tid|12345|original_header_information
# Where 12345 is the NCBI taxonomy ID

Configuration Options

Environment Variables

TAXPATH: Path to BBTools taxonomy database files. Set to "auto" for automatic detection on NERSC, or specify custom path for other environments.

gi2taxid.sh Parameters

The pipeline passes specific parameters to gi2taxid.sh for optimal performance:

-Xmx1g: Java memory allocation set to 1GB. Increase for better performance with large datasets or if memory errors occur.
pigz=16: Uses 16 parallel threads for compression. Adjust based on available CPU cores.
unpigz: Enables parallel decompression for input processing.
zl=9: Maximum compression level for output files. Reduces storage requirements at cost of processing time.
server: Enables server-based taxonomy lookup for accession-based identification when local files are unavailable.
maxbadheaders=5000: Maximum number of problematic headers to log before truncating error file. Prevents excessive disk usage from systematic failures.
bgzip: Outputs in bgzip format for improved compatibility with genomics analysis tools and random access capabilities.

Customization

Memory Adjustment

# For systems with limited memory, reduce allocation
wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/*genomic.fna.gz | \
gi2taxid.sh -Xmx512m in=stdin.fa.gz out=renamed.fa.gz pigz=8 unpigz zl=6 server ow maxbadheaders=5000 badheaders=badHeaders.txt bgzip

Thread Optimization

# Adjust pigz threads based on available CPU cores
wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/*genomic.fna.gz | \
gi2taxid.sh -Xmx1g in=stdin.fa.gz out=renamed.fa.gz pigz=32 unpigz zl=9 server ow maxbadheaders=5000 badheaders=badHeaders.txt bgzip

Partial Downloads

# Download specific taxonomic groups instead of complete RefSeq
wget -q -O - "ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/*genomic.fna.gz" | \
gi2taxid.sh -Xmx1g in=stdin.fa.gz out=bacteria_renamed.fa.gz pigz=16 unpigz zl=9 server ow maxbadheaders=5000 badheaders=badHeaders.txt bgzip

Troubleshooting

Common Issues

Network Connectivity

Timeout errors: NCBI FTP servers may have rate limiting or temporary unavailability
Incomplete downloads: Large file transfers may be interrupted; consider resumable download tools
DNS resolution: Ensure ftp.ncbi.nlm.nih.gov resolves correctly in your environment

Taxonomy Database Issues

High bad header counts: Usually indicates outdated or missing taxonomy files
Server connection failures: Check network access to taxonomy server when using server mode
Memory errors: Increase -Xmx allocation if processing fails due to insufficient memory

System Resource Constraints

Disk space: Complete RefSeq requires substantial storage; monitor available space
CPU utilization: Adjust pigz thread count to avoid overwhelming system resources
Memory pressure: Reduce Java heap size if system memory is limited

Diagnostic Commands

# Check taxonomy database availability
ls -la $TAXPATH

# Test pigz installation
pigz --version

# Verify network connectivity to NCBI
wget -q --spider ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/

# Monitor download progress
tail -f badHeaders.txt

Performance Considerations

Download Speed

Network bandwidth: Complete RefSeq download benefits from high-bandwidth connections
NCBI server load: Download speeds vary with server utilization; consider off-peak hours
Geographic location: Proximity to NCBI servers affects transfer rates

Processing Throughput

CPU cores: Pigz compression scales well with additional CPU cores
Memory allocation: Larger heap sizes improve taxonomy lookup performance
Storage I/O: Fast storage (SSD) significantly improves compression and write performance

Optimization Strategies

Parallel processing: Consider splitting downloads by taxonomic group for parallel execution
Local caching: Download taxonomy databases locally to reduce server dependency
Incremental updates: Use rsync or similar tools for incremental RefSeq updates rather than full downloads

Integration with BBTools

Downstream Applications

The taxonomically-labeled RefSeq database created by fetchRefSeq.sh serves as input for various BBTools applications:

BBMap/BBSplit: Reference-based read mapping and separation
Seal: Sequence matching and contamination detection
SendSketch: Rapid taxonomic classification via k-mer sketching
CompareSketch: Comparative genomic analysis and phylogenetic placement
Taxonomy tools: Various classification and analysis workflows

Database Management

# Create searchable index for BBMap
bbmap.sh ref=renamed.fa.gz

# Generate k-mer sketches for rapid comparison
sketch.sh in=renamed.fa.gz out=refseq_sketches mode=sequence

# Extract specific taxonomic groups
filterbyname.sh in=renamed.fa.gz out=bacteria.fa.gz names=bacteria.txt include=t

Algorithm Details

Streaming Architecture

The pipeline employs a streaming architecture that minimizes disk I/O and memory usage:

Pipeline processing: Data flows directly from wget through gi2taxid without intermediate files
Memory efficiency: Only processes sequences currently in memory, not entire dataset
Parallel compression: Pigz operates on compressed blocks while download continues
Buffered I/O: Uses optimized buffer sizes for network and disk operations

Taxonomy Mapping Algorithm

The RenameGiToTaxid Java class implements several mapping strategies:

GI number lookup: Direct mapping via GI-to-TaxID tables
Accession parsing: Extraction and lookup of NCBI accession numbers
Server fallback: Online taxonomy server queries for missing local mappings
Error tracking: Systematic logging of unmappable sequences for quality control

Compression Strategy

The pipeline uses advanced compression techniques for optimal storage efficiency:

BGZip format: Block-based compression allowing random access
Parallel compression: Multi-threaded pigz for high-throughput processing
Adaptive buffering: Dynamic buffer sizing based on system resources
Quality preservation: Lossless compression maintains sequence integrity

Error Handling

Robust error handling ensures reliable operation even with problematic input data:

Bad header tolerance: Configurable threshold for unmappable sequences
Network resilience: Automatic retry logic for transient network failures
Memory management: Automatic garbage collection and buffer management
Process monitoring: Exit code reporting and error logging for automated workflows

Historical Context

This pipeline was originally developed for the NERSC supercomputing environment to facilitate large-scale metagenomic analysis projects. The design reflects the specific requirements of high-performance computing environments:

Batch processing: Optimized for scheduled execution in job queue systems
Resource efficiency: Minimal resource footprint for shared computing environments
Network optimization: Designed for high-bandwidth, high-latency network environments
Fault tolerance: Robust error handling for long-running, unsupervised execution

Related Tools

gi2taxid.sh: Core component for taxonomic sequence renaming
downloadRefSeq.sh: Alternative download script with different options
updateTaxonomy.sh: Script for updating local taxonomy databases
sketch.sh: K-mer sketching tool for downloaded sequences
sendsketch.sh: Taxonomic classification using the created database

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org