fetchProkByGenus.sh - BBTools Pipeline

Overview

This pipeline automates the complete workflow for obtaining and preparing prokaryotic genomic data from NCBI RefSeq. It systematically downloads genomes organized by genus, creates gene models for computational gene calling, and extracts RNA sequences for classification purposes. The pipeline is designed to build comprehensive reference datasets for prokaryotic analysis.

Important: This pipeline involves downloading large amounts of data from NCBI and can take considerable time to complete. Ensure adequate network bandwidth, storage space, and processing time before starting.

Prerequisites

System Requirements

BBTools suite installed with all dependencies
Network connectivity to NCBI FTP servers
Sufficient storage space (potentially terabytes for complete datasets)
Java runtime with adequate memory (minimum 1GB, 2GB+ recommended)
Command-line tools: wget, gzip, basic Unix utilities

Network and Storage Considerations

Bandwidth: RefSeq databases are large; ensure stable internet connection
Storage: Plan for both compressed (.fna.gz, .gff.gz) and processed files
Time: Complete archaea and bacteria downloads can take days
NCBI Access: Follow NCBI usage guidelines and rate limiting

Pipeline Stages

Stage 1: Download Script Generation

Creates download scripts for both archaea and bacteria from NCBI RefSeq:

# Generate archaea download script
time fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/archaea archaea.sh 1>fetchA.o 2>&1

# Generate bacteria download script  
time fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/bacteria bacteria.sh 1>fetchB.o 2>&1

The fetchproks.sh tool crawls NCBI's FTP site and generates shell scripts containing wget commands for downloading genomes and annotations. It intelligently selects the best assembly per genus based on contiguity metrics.

Stage 2: Genome Download Execution

Executes the generated download scripts in organized directory structures:

# Setup and download archaea
mkdir archaea
cp archaea.sh archaea
cd archaea
sh archaea.sh
cd ..

# Setup and download bacteria
mkdir bacteria  
cp bacteria.sh bacteria
cd bacteria
sh bacteria.sh
cd ..

This stage creates separate directories for each domain and executes the download scripts, retrieving both FASTA (.fna.gz) and GFF (.gff.gz) files for each selected genome.

Stage 3: Gene Model Generation

Analyzes downloaded genomes to create prokaryotic gene models (.pgm files):

# Generate archaea-specific gene model
time nice analyzegenes.sh archaea/*.fna.gz out=archaea.pgm -Xmx1g

# Generate bacteria-specific gene model
time nice analyzegenes.sh bacteria/*.fna.gz out=bacteria.pgm -Xmx1g

# Generate combined prokaryotic gene model
time nice analyzegenes.sh */*.fna.gz out=model.pgm -Xmx1g

The analyzegenes.sh tool processes genome and annotation files to extract gene characteristics, creating probabilistic models for computational gene calling. Three models are generated: domain-specific (archaea, bacteria) and combined (all prokaryotes).

Stage 4: RNA Sequence Processing

Extracts and processes ribosomal and transfer RNA sequences:

# Extract RNA sequences and generate kmer sets
cutRna.sh

The final stage runs the cutRna.sh pipeline to extract 16S, 23S, 5S ribosomal RNAs and tRNAs from the downloaded genomes, then creates specialized kmer sets for sequence classification.

Algorithm Details

Genome Selection Strategy

The FetchProks Java class implements intelligent assembly selection using a multi-criteria approach:

Genus-Level Organization: Groups genomes by genus to ensure taxonomic diversity
Assembly Quality Ranking: Evaluates assemblies based on:
- Maximum contig length (primary metric)
- Total genome size
- Contig count (fewer is better)
- TaxID availability (annotated genomes preferred)
Reference Priority: Prefers reference genomes over latest assembly versions over all assembly versions
Multi-threaded Processing: Uses 7 parallel threads, with each thread handling complete genera to avoid race conditions

Gene Model Training

The AnalyzeGenes class processes genome collections to extract statistical patterns:

Codon Usage Analysis: Determines organism-specific codon preferences
Gene Length Distributions: Models typical gene lengths for different functional categories
Intergenic Region Patterns: Analyzes spacing between genes
GC Content Profiling: Captures base composition biases

Performance Optimization

The pipeline employs several optimization strategies:

Parallel Downloads: Multi-threaded FTP access with retry mechanisms (40 retries default)
Memory Management: 1GB default for gene analysis, 2GB for individual analyze steps
Error Handling: Robust retry logic for network failures and corrupted downloads
File Validation: Ensures paired .fna.gz and .gff.gz files are available before processing

Basic Usage

# Ensure adequate storage space (potentially terabytes)
df -h

# Run the complete pipeline
./fetchProkByGenus.sh

# Monitor progress (download logs are redirected)
tail -f fetchA.o  # Archaea download progress
tail -f fetchB.o  # Bacteria download progress

Configuration Options

While the main pipeline script uses default parameters, the underlying fetchproks.sh tool supports configuration:

fetchproks.sh Parameters

<url> - NCBI RefSeq FTP URL (archaea or bacteria)
<outfile> - Output shell script name
<max species per genus> - Integer limiting species per genus (pipeline uses 1)
<use best> - Boolean for assembly quality selection (pipeline uses true)

Memory Allocation

Gene Analysis: Default 1GB per analyzegenes.sh call
FetchProks: Default 1GB for download coordination
Scaling: Increase memory allocation for very large genome collections

Output Files and Directory Structure

Generated Download Scripts

archaea.sh - Shell script with wget commands for archaea genomes
bacteria.sh - Shell script with wget commands for bacteria genomes
fetchA.o - Archaea download generation log
fetchB.o - Bacteria download generation log

Downloaded Genome Data

archaea/
├── species1.fna.gz    # Genome FASTA files
├── species1.gff.gz    # Gene annotation files  
├── species2.fna.gz
├── species2.gff.gz
└── ...

bacteria/
├── species1.fna.gz
├── species1.gff.gz
├── species2.fna.gz  
├── species2.gff.gz
└── ...

Generated Models and Analysis Files

archaea.pgm - Prokaryotic gene model trained on archaea genomes
bacteria.pgm - Prokaryotic gene model trained on bacteria genomes
model.pgm - Combined prokaryotic gene model for all domains
RNA Files: Generated by cutRna.sh stage:
- 16S.fa, 23S.fa, 5S.fa - Extracted ribosomal RNA sequences
- tRNA.fa - Transfer RNA sequences
- *_15mers.fa, *_9mers.fa - Kmer sets for classification

Monitoring and Troubleshooting

Progress Monitoring

# Monitor download script generation
tail -f fetchA.o
tail -f fetchB.o

# Check download progress (after scripts execute)
ls -la archaea/  # Count downloaded archaea genomes
ls -la bacteria/ # Count downloaded bacteria genomes

# Monitor gene model generation
# (Output appears directly in terminal)

Common Issues

Network Timeouts: FetchProks includes 40-retry logic for failed downloads
Storage Full: Monitor disk usage; RefSeq can require terabytes
Memory Errors: Increase -Xmx values if analyzegenes.sh fails
FTP Access: NCBI may temporarily block high-volume access
Incomplete Downloads: Check download logs for wget failures

Validation Steps

# Verify downloaded genome counts
find archaea/ -name "*.fna.gz" | wc -l
find bacteria/ -name "*.fna.gz" | wc -l

# Check for paired files (should be equal counts)
find archaea/ -name "*.gff.gz" | wc -l
find bacteria/ -name "*.gff.gz" | wc -l

# Verify gene models were created
ls -la *.pgm

# Check RNA extraction results
ls -la *.fa

Pipeline Components

fetchproks.sh

Core download coordination tool that crawls NCBI RefSeq FTP directories:

Input: NCBI RefSeq FTP URL, output script name
Algorithm: Multi-threaded directory crawling with assembly quality assessment
Selection Criteria: Best assembly per genus based on contig metrics
Output: Shell script with optimized wget commands

analyzegenes.sh

Generates probabilistic gene models from genome collections:

Input: FASTA genome files and optional GFF annotations
Algorithm: Statistical analysis of gene patterns, codon usage, and genome organization
Output: Binary .pgm files for use with CallGenes tool
Memory: Default 2GB allocation for model training

cutRna.sh

Extracts and processes RNA sequences for classification:

Targets: 16S, 23S, 5S ribosomal RNAs and tRNAs
Method: GFF annotation-based sequence extraction
Output: FASTA files and specialized kmer sets
Purpose: Reference datasets for RNA identification and classification

Advanced Configuration

Customizing Download Parameters

For custom download behavior, modify the fetchproks.sh calls:

# Example: Download multiple species per genus
fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/bacteria bacteria.sh 3 true

# Example: Download all assemblies (not just best)
fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/archaea archaea.sh 0 false

Memory Scaling

For large datasets, increase memory allocation:

# Increase memory for gene analysis
analyzegenes.sh archaea/*.fna.gz out=archaea.pgm -Xmx4g

Selective Processing

Run individual pipeline stages:

# Only generate download scripts (no execution)
fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/bacteria bacteria.sh

# Only process existing genomes for gene models
analyzegenes.sh existing_genomes/*.fna.gz out=custom.pgm

# Only extract RNA sequences  
cutRna.sh

Use Cases and Applications

Reference Database Construction

Taxonomic Classification: Build comprehensive prokaryotic reference for metagenomic analysis
Gene Prediction: Train domain-specific models for improved gene calling accuracy
Comparative Genomics: Create standardized datasets for cross-genome analysis
Phylogenetic Studies: Obtain representative genomes across prokaryotic diversity

Downstream Analysis Integration

CallGenes: Use generated .pgm models for prokaryotic gene prediction
SendSketch: Compare unknown sequences against downloaded reference genomes
BBMap/BBSplit: Use genomes as references for read mapping and contamination detection
Taxonomy Assignment: Use extracted RNA sequences for taxonomic classification

Performance Characteristics

Download Performance

Parallelization: 7-thread processing for optimal FTP utilization
Network Efficiency: Intelligent retry logic handles temporary failures
Selection Algorithm: O(n log n) sorting by assembly quality metrics
Storage Pattern: Genus-based organization prevents filename conflicts

Memory and Processing

FetchProks: 1GB default, scales with directory listing size
AnalyzeGenes: 1-2GB default, depends on genome collection size
RNA Processing: Variable, depends on annotation density
Disk I/O: Optimized for compressed file processing

Scalability Considerations

Genome Count: Handles thousands of genomes efficiently
File Size: Processes multi-gigabyte genomes without issues
Network Robustness: Designed for unreliable network conditions
Resource Management: Conservative memory allocation prevents system overload

Integration with BBTools Ecosystem

Input to Other Tools

callgenes.sh: Uses generated .pgm models for gene prediction
sendsketch.sh: Can use downloaded genomes as custom reference database
bbmap.sh: Downloaded genomes serve as mapping references
bbsplit.sh: Use for contamination detection against prokaryotic references

Reference Data Creation

Taxonomy Databases: Provides taxonomically organized genome collections
Gene Calling Models: Domain-specific and combined prokaryotic models
RNA Classification: Kmer sets for ribosomal and transfer RNA identification
Contamination Detection: Reference datasets for identifying prokaryotic contamination

Notes and Best Practices

Data Management

Storage Planning: Complete RefSeq bacteria can exceed 500GB compressed
Update Frequency: NCBI RefSeq updates monthly; plan for periodic re-downloads
Backup Strategy: Consider backing up generated models and processed datasets
Cleanup: Remove intermediate files after successful model generation

Network Considerations

NCBI Courtesy: Avoid excessive parallel connections; respect server resources
Resume Capability: Pipeline can be restarted; existing files are typically skipped
Bandwidth Management: Schedule downloads during off-peak hours if possible
Mirror Usage: Consider using NCBI mirrors for international access

Quality Control

File Integrity: Verify downloaded files are complete and uncorrupted
Annotation Quality: Ensure GFF files contain required feature types (rRNA, tRNA)
Model Validation: Test generated .pgm files with known sequences
Taxonomic Coverage: Verify adequate genus representation in final datasets

Related Tools

fetchproks.sh - Core genome download script generator
analyzegenes.sh - Prokaryotic gene model generator
cutRna.sh - RNA sequence extraction and kmer set creation
callgenes.sh - Gene prediction using generated models
gi2taxid.sh - Taxonomic ID processing for sequence headers
cutgff.sh - GFF-based sequence extraction
kmerfilterset.sh - Kmer set generation and filtering