FetchProks

Basic Usage

fetchproks.sh <url> <outfile> <max species per genus: int> <use best: t/f>

This tool generates a shell script containing wget commands to download genome assemblies and GFF annotation files from NCBI's FTP servers.

Parameters

Fetchproks uses positional arguments instead of traditional parameter flags:

<url>: Base FTP URL from NCBI genomes directory (e.g., ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/)
<outfile>: Output shell script filename that will contain the wget commands
<max species per genus>: Integer limiting the number of species to download per genus. Use 0 for no limit, 1 for one species per genus, etc.
<use best>: Boolean (t/f or true/false) determining whether to select the best assembly based on quality metrics. When true, analyzes all assemblies and picks the one with the longest contigs.

Examples

Download Bacterial Genomes

fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/ bacteria.sh 2 true

Downloads up to 2 species per genus from bacterial RefSeq, selecting the best assembly for each species based on contiguity metrics.

Download All Archaea

fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/archaea/ archaea.sh 0 true

Downloads all archaeal species (no genus limit) with best assembly selection enabled.

Download Viral Genomes

fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/viral/ viral.sh 0 true

Downloads all viral genomes, selecting the best assembly for each species.

Download Fungal Genomes

fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/fungi/ fungi.sh 0 true

Downloads all fungal genomes with best assembly selection.

Download Plant Genomes

fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/plant/ plant.sh 0 true

Downloads all plant genomes, selecting the highest quality assembly for each species.

Download Vertebrate Genomes

fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/ vertebrate_mammalian.sh 0 true
fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_other/ vertebrate_other.sh 0 true

Downloads vertebrate genomes from both mammalian and other vertebrate categories.

Algorithm Details

Multi-threaded Processing

Fetchproks uses a fixed 7-thread architecture with genus-based work distribution. Thread assignment uses (genus.hashCode()&Integer.MAX_VALUE)%threads to ensure all species within a genus are processed by the same thread, maintaining consistency in species selection and preventing race conditions in per-genus counting. Each ProcessThread maintains its own private HashMap<String, Integer> for genus tracking, eliminating synchronization overhead during processing. The genus extraction method normalizes "Candidatus_" prefixes using substring operations before hash-based thread assignment.

Assembly Quality Assessment

When "use best" is enabled, the tool performs assembly quality analysis using the Stats compareTo() method:

Assembly Report Parsing: Downloads _assembly_report.txt files via ServerTools.readFTPFile() and parses tab-delimited lines extracting length from split[8] field
Quality Metrics: Calculates total assembly size, maximum contig length via Tools.max(max, len), contig count, and taxonomic ID from "# Taxid:" header lines
Ranking Algorithm: The Stats.compareTo() method implements a strict hierarchical comparison:
- Primary criterion: Valid taxonomic ID presence (assemblies with taxID > 0 always rank higher than those with taxID < 1)
- Size filter: Rejects assemblies with >2x size difference using comparisons size>2*b.size and size<2*b.size (prevents selection of misassembled or incomplete assemblies)
- Contiguity preference: Selects assembly with longest maximum contig length
- Final tiebreaker: Prefers assemblies with fewer total contigs (b.contigs - contigs comparison)

File Selection Strategy

The tool implements a hierarchical search for the best assemblies in each species directory:

Reference assemblies first: Searches "reference" subdirectories
Latest versions: Falls back to "latest_assembly_versions" if no reference found
All versions: Uses "all_assembly_versions" as final fallback

Output Generation

Creates wget commands for paired genomic.fna.gz and genomic.gff.gz files, skipping any "_from_genomic" variants. Three output modes are controlled by boolean flags:

Sequence Renaming (renameSequences=true): Pipes wget output through gi2taxid.sh with "deleteinvalid zl=9 server -Xmx1g" parameters to add taxonomic information to sequence headers
File Renaming (renameFiles=true): Uses wget -O to redirect downloads to species-named files instead of preserving accession-based filenames
TaxID Integration (tidInFilename=true): Prefixes output filenames with "tid_[taxID]_" pattern for taxonomic database organization

All wget operations are synchronized through the TextStreamWriter to prevent concurrent output corruption.

Genus Management

Implements genus-level filtering to prevent over-representation of prolific genera using thread-local HashMap operations:

Tracks species count per genus using seen(genus, seen) and put(genus, found, seen) methods across all ProcessThread instances
Enforces maximum species per genus limit using conditional maxSpeciesPerGenus<1 || count<maxSpeciesPerGenus
Handles "Candidatus" prefix normalization using name.substring("Candidatus_".length()) before genus extraction
Genus extraction uses name.indexOf('_') and name.substring(0, under) to isolate genus component

Error Handling and Retries

Network resilience is implemented with configurable retry logic (retries=40 by default). For assembly report downloads, the tool catches exceptions and implements exponential backoff: Thread.sleep(Tools.mid(10000, i*1000, 1000)) where retry delay is clamped between 1-10 seconds based on attempt number. Directory listing operations in ServerTools.listDirectory() also use the same retry mechanism to handle temporary FTP server unavailability.

Memory and Performance

Uses 1GB of memory by default (-Xmx1g) which is allocated through the calcXmx() function and sufficient for processing ServerTools.listDirectory() results and assembly statistics. The 7-thread architecture processes directory structures in parallel with each thread maintaining separate HashMap instances for genus tracking, avoiding memory contention during concurrent operations.

Notes

Organellar genomes: Mitochondrial, plasmid, and plastid genomes require different handling and use gbff2gff tool instead
Network requirements: Requires stable internet connection to NCBI FTP servers
Output execution: The generated shell script must be executed separately to actually download the files
File formats: Downloads compressed files (.gz) to save bandwidth and storage space
Directory structure: NCBI's directory structure can change; tool may need updates for new organizational schemes

Common Use Cases

Phylogenetic studies: Download representative genomes for comparative genomics
Database construction: Build local genome databases for BLAST or other analyses
Taxonomic surveys: Obtain genomes across taxonomic groups for diversity studies
Reference genome collection: Download high-quality reference genomes for specific clades
Annotation pipeline setup: Gather genomes and annotations for comparative annotation projects

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org