FetchProks
Writes a shell script to download one genome assembly and gff per genus or species, from NCBI. Attempts to select the best assembly on the basis of contiguity using ServerTools.listDirectory() FTP crawling with genus-based thread distribution.
Basic Usage
fetchproks.sh <url> <outfile> <max species per genus: int> <use best: t/f>
This tool generates a shell script containing wget commands to download genome assemblies and GFF annotation files from NCBI's FTP servers.
Parameters
Fetchproks uses positional arguments instead of traditional parameter flags:
- <url>
- Base FTP URL from NCBI genomes directory (e.g., ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/)
- <outfile>
- Output shell script filename that will contain the wget commands
- <max species per genus>
- Integer limiting the number of species to download per genus. Use 0 for no limit, 1 for one species per genus, etc.
- <use best>
- Boolean (t/f or true/false) determining whether to select the best assembly based on quality metrics. When true, analyzes all assemblies and picks the one with the longest contigs.
Examples
Download Bacterial Genomes
fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/ bacteria.sh 2 true
Downloads up to 2 species per genus from bacterial RefSeq, selecting the best assembly for each species based on contiguity metrics.
Download All Archaea
fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/archaea/ archaea.sh 0 true
Downloads all archaeal species (no genus limit) with best assembly selection enabled.
Download Viral Genomes
fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/viral/ viral.sh 0 true
Downloads all viral genomes, selecting the best assembly for each species.
Download Fungal Genomes
fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/fungi/ fungi.sh 0 true
Downloads all fungal genomes with best assembly selection.
Download Plant Genomes
fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/plant/ plant.sh 0 true
Downloads all plant genomes, selecting the highest quality assembly for each species.
Download Vertebrate Genomes
fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/ vertebrate_mammalian.sh 0 true
fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_other/ vertebrate_other.sh 0 true
Downloads vertebrate genomes from both mammalian and other vertebrate categories.
Algorithm Details
Multi-threaded Processing
Fetchproks uses a fixed 7-thread architecture with genus-based work distribution. Thread assignment uses (genus.hashCode()&Integer.MAX_VALUE)%threads
to ensure all species within a genus are processed by the same thread, maintaining consistency in species selection and preventing race conditions in per-genus counting. Each ProcessThread maintains its own private HashMap<String, Integer> for genus tracking, eliminating synchronization overhead during processing. The genus extraction method normalizes "Candidatus_" prefixes using substring operations before hash-based thread assignment.
Assembly Quality Assessment
When "use best" is enabled, the tool performs assembly quality analysis using the Stats compareTo() method:
- Assembly Report Parsing: Downloads _assembly_report.txt files via ServerTools.readFTPFile() and parses tab-delimited lines extracting length from split[8] field
- Quality Metrics: Calculates total assembly size, maximum contig length via Tools.max(max, len), contig count, and taxonomic ID from "# Taxid:" header lines
- Ranking Algorithm: The Stats.compareTo() method implements a strict hierarchical comparison:
- Primary criterion: Valid taxonomic ID presence (assemblies with taxID > 0 always rank higher than those with taxID < 1)
- Size filter: Rejects assemblies with >2x size difference using comparisons
size>2*b.size
andsize<2*b.size
(prevents selection of misassembled or incomplete assemblies) - Contiguity preference: Selects assembly with longest maximum contig length
- Final tiebreaker: Prefers assemblies with fewer total contigs (b.contigs - contigs comparison)
File Selection Strategy
The tool implements a hierarchical search for the best assemblies in each species directory:
- Reference assemblies first: Searches "reference" subdirectories
- Latest versions: Falls back to "latest_assembly_versions" if no reference found
- All versions: Uses "all_assembly_versions" as final fallback
Output Generation
Creates wget commands for paired genomic.fna.gz and genomic.gff.gz files, skipping any "_from_genomic" variants. Three output modes are controlled by boolean flags:
- Sequence Renaming (renameSequences=true): Pipes wget output through gi2taxid.sh with "deleteinvalid zl=9 server -Xmx1g" parameters to add taxonomic information to sequence headers
- File Renaming (renameFiles=true): Uses wget -O to redirect downloads to species-named files instead of preserving accession-based filenames
- TaxID Integration (tidInFilename=true): Prefixes output filenames with "tid_[taxID]_" pattern for taxonomic database organization
All wget operations are synchronized through the TextStreamWriter to prevent concurrent output corruption.
Genus Management
Implements genus-level filtering to prevent over-representation of prolific genera using thread-local HashMap operations:
- Tracks species count per genus using
seen(genus, seen)
andput(genus, found, seen)
methods across all ProcessThread instances - Enforces maximum species per genus limit using conditional
maxSpeciesPerGenus<1 || count<maxSpeciesPerGenus
- Handles "Candidatus" prefix normalization using
name.substring("Candidatus_".length())
before genus extraction - Genus extraction uses
name.indexOf('_')
andname.substring(0, under)
to isolate genus component
Error Handling and Retries
Network resilience is implemented with configurable retry logic (retries=40 by default). For assembly report downloads, the tool catches exceptions and implements exponential backoff: Thread.sleep(Tools.mid(10000, i*1000, 1000))
where retry delay is clamped between 1-10 seconds based on attempt number. Directory listing operations in ServerTools.listDirectory() also use the same retry mechanism to handle temporary FTP server unavailability.
Memory and Performance
Uses 1GB of memory by default (-Xmx1g) which is allocated through the calcXmx() function and sufficient for processing ServerTools.listDirectory() results and assembly statistics. The 7-thread architecture processes directory structures in parallel with each thread maintaining separate HashMap instances for genus tracking, avoiding memory contention during concurrent operations.
Notes
- Organellar genomes: Mitochondrial, plasmid, and plastid genomes require different handling and use gbff2gff tool instead
- Network requirements: Requires stable internet connection to NCBI FTP servers
- Output execution: The generated shell script must be executed separately to actually download the files
- File formats: Downloads compressed files (.gz) to save bandwidth and storage space
- Directory structure: NCBI's directory structure can change; tool may need updates for new organizational schemes
Common Use Cases
- Phylogenetic studies: Download representative genomes for comparative genomics
- Database construction: Build local genome databases for BLAST or other analyses
- Taxonomic surveys: Obtain genomes across taxonomic groups for diversity studies
- Reference genome collection: Download high-quality reference genomes for specific clades
- Annotation pipeline setup: Gather genomes and annotations for comparative annotation projects
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org