fetchRefSeq.sh

Script: fetchRefSeq.sh Source Directory: pipelines/fetch/ Author: Brian Bushnell

Pipeline script for downloading complete RefSeq genome sequences from NCBI FTP servers and renaming them with taxonomic identification numbers. Designed for creating comprehensive reference databases for metagenomic analysis, phylogenetic studies, and sequence classification workflows.

Overview

fetchRefSeq.sh is a specialized pipeline script that automates the process of downloading all complete genome sequences from NCBI's RefSeq database and preprocessing them for use with BBTools taxonomy-aware applications. The script combines high-speed parallel downloading with taxonomic header renaming to create ready-to-use reference databases.

Key Features

Prerequisites

Required Software

System Requirements

Taxonomy Database Setup

Critical: The taxonomy server must be updated before running this script, or you must have local taxonomy data available. The script uses the TAXPATH variable to locate taxonomy files:

# Default setting (auto-detect)
TAXPATH="auto"

# Custom path for non-NERSC environments
TAXPATH="/path/to/taxonomy_directory/"

Usage

Basic Execution

# Run the complete RefSeq download and processing pipeline
./fetchRefSeq.sh

Environment Configuration

For systems outside of NERSC, modify the script to set the correct taxonomy path:

# Edit the TAXPATH variable in fetchRefSeq.sh
TAXPATH="/your/taxonomy/directory/"

# Ensure pigz is available in your PATH
module load pigz  # On HPC systems
# or
export PATH=$PATH:/path/to/pigz  # For custom installations

Pipeline Process

Download Phase

The script downloads all complete genomic FASTA files from NCBI RefSeq using a streaming approach:

wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/*genomic.fna.gz

Processing Phase

Downloaded sequences are immediately processed through gi2taxid.sh with the following parameters:

Taxonomy Integration

The gi2taxid.sh component performs several critical functions:

Output Files

Primary Output

renamed.fa.gz
Main output file containing all RefSeq sequences with taxonomically-enhanced headers. Compressed using bgzip format for efficient random access and compatibility with genomics tools.

Error Tracking

badHeaders.txt
Log file containing sequence headers that could not be successfully mapped to taxonomy IDs. Used for quality control and troubleshooting taxonomy database issues.

Header Format

Successfully processed sequences receive headers in the format:

>tid|12345|original_header_information
# Where 12345 is the NCBI taxonomy ID

Configuration Options

Environment Variables

TAXPATH
Path to BBTools taxonomy database files. Set to "auto" for automatic detection on NERSC, or specify custom path for other environments.

gi2taxid.sh Parameters

The pipeline passes specific parameters to gi2taxid.sh for optimal performance:

-Xmx1g
Java memory allocation set to 1GB. Increase for better performance with large datasets or if memory errors occur.
pigz=16
Uses 16 parallel threads for compression. Adjust based on available CPU cores.
unpigz
Enables parallel decompression for input processing.
zl=9
Maximum compression level for output files. Reduces storage requirements at cost of processing time.
server
Enables server-based taxonomy lookup for accession-based identification when local files are unavailable.
maxbadheaders=5000
Maximum number of problematic headers to log before truncating error file. Prevents excessive disk usage from systematic failures.
bgzip
Outputs in bgzip format for improved compatibility with genomics analysis tools and random access capabilities.

Customization

Memory Adjustment

# For systems with limited memory, reduce allocation
wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/*genomic.fna.gz | \
gi2taxid.sh -Xmx512m in=stdin.fa.gz out=renamed.fa.gz pigz=8 unpigz zl=6 server ow maxbadheaders=5000 badheaders=badHeaders.txt bgzip

Thread Optimization

# Adjust pigz threads based on available CPU cores
wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/*genomic.fna.gz | \
gi2taxid.sh -Xmx1g in=stdin.fa.gz out=renamed.fa.gz pigz=32 unpigz zl=9 server ow maxbadheaders=5000 badheaders=badHeaders.txt bgzip

Partial Downloads

# Download specific taxonomic groups instead of complete RefSeq
wget -q -O - "ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/*genomic.fna.gz" | \
gi2taxid.sh -Xmx1g in=stdin.fa.gz out=bacteria_renamed.fa.gz pigz=16 unpigz zl=9 server ow maxbadheaders=5000 badheaders=badHeaders.txt bgzip

Troubleshooting

Common Issues

Network Connectivity

Taxonomy Database Issues

System Resource Constraints

Diagnostic Commands

# Check taxonomy database availability
ls -la $TAXPATH

# Test pigz installation
pigz --version

# Verify network connectivity to NCBI
wget -q --spider ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/

# Monitor download progress
tail -f badHeaders.txt

Performance Considerations

Download Speed

Processing Throughput

Optimization Strategies

Integration with BBTools

Downstream Applications

The taxonomically-labeled RefSeq database created by fetchRefSeq.sh serves as input for various BBTools applications:

Database Management

# Create searchable index for BBMap
bbmap.sh ref=renamed.fa.gz

# Generate k-mer sketches for rapid comparison
sketch.sh in=renamed.fa.gz out=refseq_sketches mode=sequence

# Extract specific taxonomic groups
filterbyname.sh in=renamed.fa.gz out=bacteria.fa.gz names=bacteria.txt include=t

Algorithm Details

Streaming Architecture

The pipeline employs a streaming architecture that minimizes disk I/O and memory usage:

Taxonomy Mapping Algorithm

The RenameGiToTaxid Java class implements several mapping strategies:

Compression Strategy

The pipeline uses advanced compression techniques for optimal storage efficiency:

Error Handling

Robust error handling ensures reliable operation even with problematic input data:

Historical Context

This pipeline was originally developed for the NERSC supercomputing environment to facilitate large-scale metagenomic analysis projects. The design reflects the specific requirements of high-performance computing environments:

Related Tools

Support

For questions and support: