fetchProkByGenus.sh

Script: fetchProkByGenus.sh Source Directory: pipelines/fetch/ Author: Brian Bushnell

Comprehensive pipeline for downloading prokaryotic genomes organized by genus from NCBI RefSeq databases. Downloads both archaea and bacteria, generates prokaryotic gene models for each domain and combined datasets, then extracts ribosomal and transfer RNA sequences for downstream analysis.

Overview

This pipeline automates the complete workflow for obtaining and preparing prokaryotic genomic data from NCBI RefSeq. It systematically downloads genomes organized by genus, creates gene models for computational gene calling, and extracts RNA sequences for classification purposes. The pipeline is designed to build comprehensive reference datasets for prokaryotic analysis.

Important: This pipeline involves downloading large amounts of data from NCBI and can take considerable time to complete. Ensure adequate network bandwidth, storage space, and processing time before starting.

Prerequisites

System Requirements

Network and Storage Considerations

Pipeline Stages

Stage 1: Download Script Generation

Creates download scripts for both archaea and bacteria from NCBI RefSeq:

# Generate archaea download script
time fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/archaea archaea.sh 1>fetchA.o 2>&1

# Generate bacteria download script  
time fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/bacteria bacteria.sh 1>fetchB.o 2>&1

The fetchproks.sh tool crawls NCBI's FTP site and generates shell scripts containing wget commands for downloading genomes and annotations. It intelligently selects the best assembly per genus based on contiguity metrics.

Stage 2: Genome Download Execution

Executes the generated download scripts in organized directory structures:

# Setup and download archaea
mkdir archaea
cp archaea.sh archaea
cd archaea
sh archaea.sh
cd ..

# Setup and download bacteria
mkdir bacteria  
cp bacteria.sh bacteria
cd bacteria
sh bacteria.sh
cd ..

This stage creates separate directories for each domain and executes the download scripts, retrieving both FASTA (.fna.gz) and GFF (.gff.gz) files for each selected genome.

Stage 3: Gene Model Generation

Analyzes downloaded genomes to create prokaryotic gene models (.pgm files):

# Generate archaea-specific gene model
time nice analyzegenes.sh archaea/*.fna.gz out=archaea.pgm -Xmx1g

# Generate bacteria-specific gene model
time nice analyzegenes.sh bacteria/*.fna.gz out=bacteria.pgm -Xmx1g

# Generate combined prokaryotic gene model
time nice analyzegenes.sh */*.fna.gz out=model.pgm -Xmx1g

The analyzegenes.sh tool processes genome and annotation files to extract gene characteristics, creating probabilistic models for computational gene calling. Three models are generated: domain-specific (archaea, bacteria) and combined (all prokaryotes).

Stage 4: RNA Sequence Processing

Extracts and processes ribosomal and transfer RNA sequences:

# Extract RNA sequences and generate kmer sets
cutRna.sh

The final stage runs the cutRna.sh pipeline to extract 16S, 23S, 5S ribosomal RNAs and tRNAs from the downloaded genomes, then creates specialized kmer sets for sequence classification.

Algorithm Details

Genome Selection Strategy

The FetchProks Java class implements intelligent assembly selection using a multi-criteria approach:

Gene Model Training

The AnalyzeGenes class processes genome collections to extract statistical patterns:

Performance Optimization

The pipeline employs several optimization strategies:

Basic Usage

# Ensure adequate storage space (potentially terabytes)
df -h

# Run the complete pipeline
./fetchProkByGenus.sh

# Monitor progress (download logs are redirected)
tail -f fetchA.o  # Archaea download progress
tail -f fetchB.o  # Bacteria download progress

Configuration Options

While the main pipeline script uses default parameters, the underlying fetchproks.sh tool supports configuration:

fetchproks.sh Parameters

Memory Allocation

Output Files and Directory Structure

Generated Download Scripts

Downloaded Genome Data

archaea/
├── species1.fna.gz    # Genome FASTA files
├── species1.gff.gz    # Gene annotation files  
├── species2.fna.gz
├── species2.gff.gz
└── ...

bacteria/
├── species1.fna.gz
├── species1.gff.gz
├── species2.fna.gz  
├── species2.gff.gz
└── ...

Generated Models and Analysis Files

Monitoring and Troubleshooting

Progress Monitoring

# Monitor download script generation
tail -f fetchA.o
tail -f fetchB.o

# Check download progress (after scripts execute)
ls -la archaea/  # Count downloaded archaea genomes
ls -la bacteria/ # Count downloaded bacteria genomes

# Monitor gene model generation
# (Output appears directly in terminal)

Common Issues

Validation Steps

# Verify downloaded genome counts
find archaea/ -name "*.fna.gz" | wc -l
find bacteria/ -name "*.fna.gz" | wc -l

# Check for paired files (should be equal counts)
find archaea/ -name "*.gff.gz" | wc -l
find bacteria/ -name "*.gff.gz" | wc -l

# Verify gene models were created
ls -la *.pgm

# Check RNA extraction results
ls -la *.fa

Pipeline Components

fetchproks.sh

Core download coordination tool that crawls NCBI RefSeq FTP directories:

analyzegenes.sh

Generates probabilistic gene models from genome collections:

cutRna.sh

Extracts and processes RNA sequences for classification:

Advanced Configuration

Customizing Download Parameters

For custom download behavior, modify the fetchproks.sh calls:

# Example: Download multiple species per genus
fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/bacteria bacteria.sh 3 true

# Example: Download all assemblies (not just best)
fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/archaea archaea.sh 0 false

Memory Scaling

For large datasets, increase memory allocation:

# Increase memory for gene analysis
analyzegenes.sh archaea/*.fna.gz out=archaea.pgm -Xmx4g

Selective Processing

Run individual pipeline stages:

# Only generate download scripts (no execution)
fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/bacteria bacteria.sh

# Only process existing genomes for gene models
analyzegenes.sh existing_genomes/*.fna.gz out=custom.pgm

# Only extract RNA sequences  
cutRna.sh

Use Cases and Applications

Reference Database Construction

Downstream Analysis Integration

Performance Characteristics

Download Performance

Memory and Processing

Scalability Considerations

Integration with BBTools Ecosystem

Input to Other Tools

Reference Data Creation

Notes and Best Practices

Data Management

Network Considerations

Quality Control

Related Tools