Fetch Plastid Pipeline

Script: fetchPlastid.sh Source Directory: pipelines/fetch/ Author: Brian Bushnell

Pipeline script to download all plastid genomes and annotations from NCBI RefSeq and convert them to GFF format. This streamlined pipeline fetches chloroplast/plastid sequences in both FASTA and GenBank formats, then converts the annotations to standardized GFF3 format.

Overview

This pipeline is designed for researchers working with plastid genomes (chloroplasts and other plastid organelles). It performs a complete download of all plastid genomes from NCBI's RefSeq database and converts the GenBank annotations to GFF3 format for downstream analysis. The pipeline is particularly useful for comparative plastid genomics, phylogenetic studies, and annotation projects.

Note: This pipeline downloads ALL plastid genomes from NCBI RefSeq, which can be a substantial amount of data (thousands of genomes). Ensure you have adequate disk space and network bandwidth before running.

Pipeline Steps

1. Download Plastid Genomes (FASTA)

wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plastid/*genomic.fna.gz > plastid.genomic.fna.gz

Downloads all plastid genome sequences in FASTA format from NCBI RefSeq. The wildcard (*) captures all available plastid genomes, and the output is concatenated into a single compressed file.

2. Download Plastid Annotations (GenBank)

wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plastid/*genomic.gbff.gz > plastid.genomic.gbff.gz

Downloads all plastid genome annotations in GenBank format. These files contain detailed feature annotations including genes, CDS, rRNA, tRNA, and other genomic elements.

3. Convert GenBank to GFF3

gbff2gff.sh plastid.genomic.gbff.gz plastid.genomic.gff.gz

Converts the GenBank format annotations to standardized GFF3 format using the gbff2gff tool. This creates a more standardized annotation format suitable for most genomic analysis pipelines.

Usage

# Run the complete pipeline
bash pipelines/fetch/fetchPlastid.sh

# Or execute individual steps:
cd /path/to/output/directory
bash /path/to/BBTools/pipelines/fetch/fetchPlastid.sh

The pipeline requires no command-line arguments and downloads files to the current working directory.

Prerequisites

System Requirements

Network Access

Output Files

The pipeline generates three primary output files in the current working directory:

plastid.genomic.fna.gz
Compressed FASTA file containing all plastid genome sequences from NCBI RefSeq. This includes chloroplast genomes from plants, algae, and some bacteria.
plastid.genomic.gbff.gz
Compressed GenBank format file containing all plastid genome annotations. Includes detailed feature annotations, gene predictions, and metadata for each genome.
plastid.genomic.gff.gz
Compressed GFF3 format file converted from the GenBank annotations. This standardized format is compatible with most genomic analysis tools and visualization software.

Data Characteristics

Content Coverage

The NCBI RefSeq plastid collection includes:

Annotation Features

The GenBank files contain comprehensive annotations including:

Common Use Cases

Performance Considerations

Download Time

Download time depends on:

Storage Requirements

Processing Performance

The gbff2gff conversion step is typically fast (minutes) compared to the download phase. Memory usage is minimal for the conversion process.

Post-Processing Options

Data Filtering

After download, you may want to filter the data:

# Extract specific taxonomic groups
filterbyname.sh in=plastid.genomic.fna.gz out=plant_chloroplasts.fna.gz names=chloroplast include=t

# Filter by size (remove very small or incomplete genomes)
filterbylength.sh in=plastid.genomic.fna.gz out=complete_plastids.fna.gz minlen=100000

Database Construction

Create searchable databases:

# Create BBMap index for mapping
bbmap.sh ref=plastid.genomic.fna.gz

# Create sketch database for rapid similarity search
bbsketch.sh in=plastid.genomic.fna.gz out=plastid_sketches.sketch

Sequence Analysis

Analyze the downloaded genomes:

# Generate assembly statistics
stats.sh in=plastid.genomic.fna.gz

# Analyze GC content and composition
countgc.sh in=plastid.genomic.fna.gz

Troubleshooting

Network Issues

Storage Issues

Processing Issues

Related Tools

Notes

Technical Implementation

Download Strategy

The pipeline uses wget with specific optimizations:

File Format Handling

The pipeline handles two complementary data types:

Conversion Process

The gbff2gff.sh conversion implements selective feature extraction:

Data Sources and Content

NCBI RefSeq Plastid Collection

The source directory contains genomes from:

Genome Characteristics

Plastid genomes typically exhibit:

Integration with BBTools

Downstream Analysis

The downloaded data integrates seamlessly with BBTools workflows:

Mapping and Alignment

# Create reference index
bbmap.sh ref=plastid.genomic.fna.gz

# Map reads to plastid references
bbmap.sh in=reads.fq ref=plastid.genomic.fna.gz out=mapped.sam

Similarity Searching

# Create sketch database
bbsketch.sh in=plastid.genomic.fna.gz out=plastid.sketch

# Query unknown sequences
comparesketch.sh in=unknown.fa ref=plastid.sketch

Taxonomic Classification

# Classify sequences using plastid references
sendsketch.sh in=query.fasta db=nt coverage=50

Annotation Analysis

The GFF3 files support various annotation workflows:

Research Applications

Phylogenetic Studies

Comparative Genomics

Functional Genomics

Environmental Studies

Quality Assurance

Data Verification

After running the pipeline, verify data integrity:

# Check file completeness
ls -lh plastid.genomic.*

# Test compression integrity
gzip -t plastid.genomic.*.gz

# Count sequences in FASTA
grep -c "^>" plastid.genomic.fna

# Verify GFF3 format
head -20 plastid.genomic.gff

Expected Results

Successful pipeline execution should produce:

Maintenance and Updates

Regular Updates

Consider periodic re-runs to capture:

Version Tracking

Track download dates and versions:

# Add timestamp to output files
DATE=$(date +%Y%m%d)
mv plastid.genomic.fna.gz plastid.genomic.${DATE}.fna.gz

Alternative Approaches

Selective Downloads

For more targeted downloads, consider using fetchproks.sh with specific parameters or manual wget commands for specific taxonomic groups.

Format Alternatives

If you prefer different annotation formats:

Support

For questions and support: