FetchProks

Script: fetchproks.sh Package: prok Class: FetchProks.java

Writes a shell script to download one genome assembly and gff per genus or species, from NCBI. Attempts to select the best assembly on the basis of contiguity using ServerTools.listDirectory() FTP crawling with genus-based thread distribution.

Basic Usage

fetchproks.sh <url> <outfile> <max species per genus: int> <use best: t/f>

This tool generates a shell script containing wget commands to download genome assemblies and GFF annotation files from NCBI's FTP servers.

Parameters

Fetchproks uses positional arguments instead of traditional parameter flags:

<url>
Base FTP URL from NCBI genomes directory (e.g., ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/)
<outfile>
Output shell script filename that will contain the wget commands
<max species per genus>
Integer limiting the number of species to download per genus. Use 0 for no limit, 1 for one species per genus, etc.
<use best>
Boolean (t/f or true/false) determining whether to select the best assembly based on quality metrics. When true, analyzes all assemblies and picks the one with the longest contigs.

Examples

Download Bacterial Genomes

fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/ bacteria.sh 2 true

Downloads up to 2 species per genus from bacterial RefSeq, selecting the best assembly for each species based on contiguity metrics.

Download All Archaea

fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/archaea/ archaea.sh 0 true

Downloads all archaeal species (no genus limit) with best assembly selection enabled.

Download Viral Genomes

fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/viral/ viral.sh 0 true

Downloads all viral genomes, selecting the best assembly for each species.

Download Fungal Genomes

fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/fungi/ fungi.sh 0 true

Downloads all fungal genomes with best assembly selection.

Download Plant Genomes

fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/plant/ plant.sh 0 true

Downloads all plant genomes, selecting the highest quality assembly for each species.

Download Vertebrate Genomes

fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/ vertebrate_mammalian.sh 0 true
fetchproks.sh ftp://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_other/ vertebrate_other.sh 0 true

Downloads vertebrate genomes from both mammalian and other vertebrate categories.

Algorithm Details

Multi-threaded Processing

Fetchproks uses a fixed 7-thread architecture with genus-based work distribution. Thread assignment uses (genus.hashCode()&Integer.MAX_VALUE)%threads to ensure all species within a genus are processed by the same thread, maintaining consistency in species selection and preventing race conditions in per-genus counting. Each ProcessThread maintains its own private HashMap<String, Integer> for genus tracking, eliminating synchronization overhead during processing. The genus extraction method normalizes "Candidatus_" prefixes using substring operations before hash-based thread assignment.

Assembly Quality Assessment

When "use best" is enabled, the tool performs assembly quality analysis using the Stats compareTo() method:

File Selection Strategy

The tool implements a hierarchical search for the best assemblies in each species directory:

  1. Reference assemblies first: Searches "reference" subdirectories
  2. Latest versions: Falls back to "latest_assembly_versions" if no reference found
  3. All versions: Uses "all_assembly_versions" as final fallback

Output Generation

Creates wget commands for paired genomic.fna.gz and genomic.gff.gz files, skipping any "_from_genomic" variants. Three output modes are controlled by boolean flags:

All wget operations are synchronized through the TextStreamWriter to prevent concurrent output corruption.

Genus Management

Implements genus-level filtering to prevent over-representation of prolific genera using thread-local HashMap operations:

Error Handling and Retries

Network resilience is implemented with configurable retry logic (retries=40 by default). For assembly report downloads, the tool catches exceptions and implements exponential backoff: Thread.sleep(Tools.mid(10000, i*1000, 1000)) where retry delay is clamped between 1-10 seconds based on attempt number. Directory listing operations in ServerTools.listDirectory() also use the same retry mechanism to handle temporary FTP server unavailability.

Memory and Performance

Uses 1GB of memory by default (-Xmx1g) which is allocated through the calcXmx() function and sufficient for processing ServerTools.listDirectory() results and assembly statistics. The 7-thread architecture processes directory structures in parallel with each thread maintaining separate HashMap instances for genus tracking, avoiding memory contention during concurrent operations.

Notes

Common Use Cases

Support

For questions and support: