fetchProkByGenus.sh
Comprehensive pipeline for downloading prokaryotic genomes organized by genus from NCBI RefSeq databases. Downloads both archaea and bacteria, generates prokaryotic gene models for each domain and combined datasets, then extracts ribosomal and transfer RNA sequences for downstream analysis.
Overview
This pipeline automates the complete workflow for obtaining and preparing prokaryotic genomic data from NCBI RefSeq. It systematically downloads genomes organized by genus, creates gene models for computational gene calling, and extracts RNA sequences for classification purposes. The pipeline is designed to build comprehensive reference datasets for prokaryotic analysis.
Prerequisites
System Requirements
- BBTools suite installed with all dependencies
- Network connectivity to NCBI FTP servers
- Sufficient storage space (potentially terabytes for complete datasets)
- Java runtime with adequate memory (minimum 1GB, 2GB+ recommended)
- Command-line tools: wget, gzip, basic Unix utilities
Network and Storage Considerations
- Bandwidth: RefSeq databases are large; ensure stable internet connection
- Storage: Plan for both compressed (.fna.gz, .gff.gz) and processed files
- Time: Complete archaea and bacteria downloads can take days
- NCBI Access: Follow NCBI usage guidelines and rate limiting
Pipeline Stages
Stage 1: Download Script Generation
Creates download scripts for both archaea and bacteria from NCBI RefSeq:
# Generate archaea download script
time fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/archaea archaea.sh 1>fetchA.o 2>&1
# Generate bacteria download script
time fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/bacteria bacteria.sh 1>fetchB.o 2>&1
The fetchproks.sh
tool crawls NCBI's FTP site and generates shell scripts containing wget commands for downloading genomes and annotations. It intelligently selects the best assembly per genus based on contiguity metrics.
Stage 2: Genome Download Execution
Executes the generated download scripts in organized directory structures:
# Setup and download archaea
mkdir archaea
cp archaea.sh archaea
cd archaea
sh archaea.sh
cd ..
# Setup and download bacteria
mkdir bacteria
cp bacteria.sh bacteria
cd bacteria
sh bacteria.sh
cd ..
This stage creates separate directories for each domain and executes the download scripts, retrieving both FASTA (.fna.gz) and GFF (.gff.gz) files for each selected genome.
Stage 3: Gene Model Generation
Analyzes downloaded genomes to create prokaryotic gene models (.pgm files):
# Generate archaea-specific gene model
time nice analyzegenes.sh archaea/*.fna.gz out=archaea.pgm -Xmx1g
# Generate bacteria-specific gene model
time nice analyzegenes.sh bacteria/*.fna.gz out=bacteria.pgm -Xmx1g
# Generate combined prokaryotic gene model
time nice analyzegenes.sh */*.fna.gz out=model.pgm -Xmx1g
The analyzegenes.sh
tool processes genome and annotation files to extract gene characteristics, creating probabilistic models for computational gene calling. Three models are generated: domain-specific (archaea, bacteria) and combined (all prokaryotes).
Stage 4: RNA Sequence Processing
Extracts and processes ribosomal and transfer RNA sequences:
# Extract RNA sequences and generate kmer sets
cutRna.sh
The final stage runs the cutRna.sh
pipeline to extract 16S, 23S, 5S ribosomal RNAs and tRNAs from the downloaded genomes, then creates specialized kmer sets for sequence classification.
Algorithm Details
Genome Selection Strategy
The FetchProks
Java class implements intelligent assembly selection using a multi-criteria approach:
- Genus-Level Organization: Groups genomes by genus to ensure taxonomic diversity
- Assembly Quality Ranking: Evaluates assemblies based on:
- Maximum contig length (primary metric)
- Total genome size
- Contig count (fewer is better)
- TaxID availability (annotated genomes preferred)
- Reference Priority: Prefers reference genomes over latest assembly versions over all assembly versions
- Multi-threaded Processing: Uses 7 parallel threads, with each thread handling complete genera to avoid race conditions
Gene Model Training
The AnalyzeGenes
class processes genome collections to extract statistical patterns:
- Codon Usage Analysis: Determines organism-specific codon preferences
- Gene Length Distributions: Models typical gene lengths for different functional categories
- Intergenic Region Patterns: Analyzes spacing between genes
- GC Content Profiling: Captures base composition biases
Performance Optimization
The pipeline employs several optimization strategies:
- Parallel Downloads: Multi-threaded FTP access with retry mechanisms (40 retries default)
- Memory Management: 1GB default for gene analysis, 2GB for individual analyze steps
- Error Handling: Robust retry logic for network failures and corrupted downloads
- File Validation: Ensures paired .fna.gz and .gff.gz files are available before processing
Basic Usage
# Ensure adequate storage space (potentially terabytes)
df -h
# Run the complete pipeline
./fetchProkByGenus.sh
# Monitor progress (download logs are redirected)
tail -f fetchA.o # Archaea download progress
tail -f fetchB.o # Bacteria download progress
Configuration Options
While the main pipeline script uses default parameters, the underlying fetchproks.sh
tool supports configuration:
fetchproks.sh Parameters
<url>
- NCBI RefSeq FTP URL (archaea or bacteria)<outfile>
- Output shell script name<max species per genus>
- Integer limiting species per genus (pipeline uses 1)<use best>
- Boolean for assembly quality selection (pipeline uses true)
Memory Allocation
- Gene Analysis: Default 1GB per analyzegenes.sh call
- FetchProks: Default 1GB for download coordination
- Scaling: Increase memory allocation for very large genome collections
Output Files and Directory Structure
Generated Download Scripts
archaea.sh
- Shell script with wget commands for archaea genomesbacteria.sh
- Shell script with wget commands for bacteria genomesfetchA.o
- Archaea download generation logfetchB.o
- Bacteria download generation log
Downloaded Genome Data
archaea/
├── species1.fna.gz # Genome FASTA files
├── species1.gff.gz # Gene annotation files
├── species2.fna.gz
├── species2.gff.gz
└── ...
bacteria/
├── species1.fna.gz
├── species1.gff.gz
├── species2.fna.gz
├── species2.gff.gz
└── ...
Generated Models and Analysis Files
archaea.pgm
- Prokaryotic gene model trained on archaea genomesbacteria.pgm
- Prokaryotic gene model trained on bacteria genomesmodel.pgm
- Combined prokaryotic gene model for all domains- RNA Files: Generated by cutRna.sh stage:
16S.fa, 23S.fa, 5S.fa
- Extracted ribosomal RNA sequencestRNA.fa
- Transfer RNA sequences*_15mers.fa, *_9mers.fa
- Kmer sets for classification
Monitoring and Troubleshooting
Progress Monitoring
# Monitor download script generation
tail -f fetchA.o
tail -f fetchB.o
# Check download progress (after scripts execute)
ls -la archaea/ # Count downloaded archaea genomes
ls -la bacteria/ # Count downloaded bacteria genomes
# Monitor gene model generation
# (Output appears directly in terminal)
Common Issues
- Network Timeouts: FetchProks includes 40-retry logic for failed downloads
- Storage Full: Monitor disk usage; RefSeq can require terabytes
- Memory Errors: Increase -Xmx values if analyzegenes.sh fails
- FTP Access: NCBI may temporarily block high-volume access
- Incomplete Downloads: Check download logs for wget failures
Validation Steps
# Verify downloaded genome counts
find archaea/ -name "*.fna.gz" | wc -l
find bacteria/ -name "*.fna.gz" | wc -l
# Check for paired files (should be equal counts)
find archaea/ -name "*.gff.gz" | wc -l
find bacteria/ -name "*.gff.gz" | wc -l
# Verify gene models were created
ls -la *.pgm
# Check RNA extraction results
ls -la *.fa
Pipeline Components
fetchproks.sh
Core download coordination tool that crawls NCBI RefSeq FTP directories:
- Input: NCBI RefSeq FTP URL, output script name
- Algorithm: Multi-threaded directory crawling with assembly quality assessment
- Selection Criteria: Best assembly per genus based on contig metrics
- Output: Shell script with optimized wget commands
analyzegenes.sh
Generates probabilistic gene models from genome collections:
- Input: FASTA genome files and optional GFF annotations
- Algorithm: Statistical analysis of gene patterns, codon usage, and genome organization
- Output: Binary .pgm files for use with CallGenes tool
- Memory: Default 2GB allocation for model training
cutRna.sh
Extracts and processes RNA sequences for classification:
- Targets: 16S, 23S, 5S ribosomal RNAs and tRNAs
- Method: GFF annotation-based sequence extraction
- Output: FASTA files and specialized kmer sets
- Purpose: Reference datasets for RNA identification and classification
Advanced Configuration
Customizing Download Parameters
For custom download behavior, modify the fetchproks.sh calls:
# Example: Download multiple species per genus
fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/bacteria bacteria.sh 3 true
# Example: Download all assemblies (not just best)
fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/archaea archaea.sh 0 false
Memory Scaling
For large datasets, increase memory allocation:
# Increase memory for gene analysis
analyzegenes.sh archaea/*.fna.gz out=archaea.pgm -Xmx4g
Selective Processing
Run individual pipeline stages:
# Only generate download scripts (no execution)
fetchproks.sh ftp://ftp.ncbi.nih.gov:21/genomes/refseq/bacteria bacteria.sh
# Only process existing genomes for gene models
analyzegenes.sh existing_genomes/*.fna.gz out=custom.pgm
# Only extract RNA sequences
cutRna.sh
Use Cases and Applications
Reference Database Construction
- Taxonomic Classification: Build comprehensive prokaryotic reference for metagenomic analysis
- Gene Prediction: Train domain-specific models for improved gene calling accuracy
- Comparative Genomics: Create standardized datasets for cross-genome analysis
- Phylogenetic Studies: Obtain representative genomes across prokaryotic diversity
Downstream Analysis Integration
- CallGenes: Use generated .pgm models for prokaryotic gene prediction
- SendSketch: Compare unknown sequences against downloaded reference genomes
- BBMap/BBSplit: Use genomes as references for read mapping and contamination detection
- Taxonomy Assignment: Use extracted RNA sequences for taxonomic classification
Performance Characteristics
Download Performance
- Parallelization: 7-thread processing for optimal FTP utilization
- Network Efficiency: Intelligent retry logic handles temporary failures
- Selection Algorithm: O(n log n) sorting by assembly quality metrics
- Storage Pattern: Genus-based organization prevents filename conflicts
Memory and Processing
- FetchProks: 1GB default, scales with directory listing size
- AnalyzeGenes: 1-2GB default, depends on genome collection size
- RNA Processing: Variable, depends on annotation density
- Disk I/O: Optimized for compressed file processing
Scalability Considerations
- Genome Count: Handles thousands of genomes efficiently
- File Size: Processes multi-gigabyte genomes without issues
- Network Robustness: Designed for unreliable network conditions
- Resource Management: Conservative memory allocation prevents system overload
Integration with BBTools Ecosystem
Input to Other Tools
- callgenes.sh: Uses generated .pgm models for gene prediction
- sendsketch.sh: Can use downloaded genomes as custom reference database
- bbmap.sh: Downloaded genomes serve as mapping references
- bbsplit.sh: Use for contamination detection against prokaryotic references
Reference Data Creation
- Taxonomy Databases: Provides taxonomically organized genome collections
- Gene Calling Models: Domain-specific and combined prokaryotic models
- RNA Classification: Kmer sets for ribosomal and transfer RNA identification
- Contamination Detection: Reference datasets for identifying prokaryotic contamination
Notes and Best Practices
Data Management
- Storage Planning: Complete RefSeq bacteria can exceed 500GB compressed
- Update Frequency: NCBI RefSeq updates monthly; plan for periodic re-downloads
- Backup Strategy: Consider backing up generated models and processed datasets
- Cleanup: Remove intermediate files after successful model generation
Network Considerations
- NCBI Courtesy: Avoid excessive parallel connections; respect server resources
- Resume Capability: Pipeline can be restarted; existing files are typically skipped
- Bandwidth Management: Schedule downloads during off-peak hours if possible
- Mirror Usage: Consider using NCBI mirrors for international access
Quality Control
- File Integrity: Verify downloaded files are complete and uncorrupted
- Annotation Quality: Ensure GFF files contain required feature types (rRNA, tRNA)
- Model Validation: Test generated .pgm files with known sequences
- Taxonomic Coverage: Verify adequate genus representation in final datasets
Related Tools
fetchproks.sh
- Core genome download script generatoranalyzegenes.sh
- Prokaryotic gene model generatorcutRna.sh
- RNA sequence extraction and kmer set creationcallgenes.sh
- Gene prediction using generated modelsgi2taxid.sh
- Taxonomic ID processing for sequence headerscutgff.sh
- GFF-based sequence extractionkmerfilterset.sh
- Kmer set generation and filtering