Fetch Plastid Pipeline

Overview

This pipeline is designed for researchers working with plastid genomes (chloroplasts and other plastid organelles). It performs a complete download of all plastid genomes from NCBI's RefSeq database and converts the GenBank annotations to GFF3 format for downstream analysis. The pipeline is particularly useful for comparative plastid genomics, phylogenetic studies, and annotation projects.

Note: This pipeline downloads ALL plastid genomes from NCBI RefSeq, which can be a substantial amount of data (thousands of genomes). Ensure you have adequate disk space and network bandwidth before running.

Pipeline Steps

1. Download Plastid Genomes (FASTA)

wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plastid/*genomic.fna.gz > plastid.genomic.fna.gz

Downloads all plastid genome sequences in FASTA format from NCBI RefSeq. The wildcard (*) captures all available plastid genomes, and the output is concatenated into a single compressed file.

2. Download Plastid Annotations (GenBank)

wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plastid/*genomic.gbff.gz > plastid.genomic.gbff.gz

Downloads all plastid genome annotations in GenBank format. These files contain detailed feature annotations including genes, CDS, rRNA, tRNA, and other genomic elements.

3. Convert GenBank to GFF3

gbff2gff.sh plastid.genomic.gbff.gz plastid.genomic.gff.gz

Converts the GenBank format annotations to standardized GFF3 format using the gbff2gff tool. This creates a more standardized annotation format suitable for most genomic analysis pipelines.

Usage

# Run the complete pipeline
bash pipelines/fetch/fetchPlastid.sh

# Or execute individual steps:
cd /path/to/output/directory
bash /path/to/BBTools/pipelines/fetch/fetchPlastid.sh

The pipeline requires no command-line arguments and downloads files to the current working directory.

Prerequisites

System Requirements

BBTools suite installed with gbff2gff.sh available
wget for downloading files from NCBI FTP
Internet connection with access to NCBI FTP servers
Sufficient disk space (several GB for all plastid genomes)

Network Access

Access to ftp.ncbi.nih.gov on port 21 (FTP)
Ability to download large files (multi-GB transfers)
Stable internet connection for extended download periods

Output Files

The pipeline generates three primary output files in the current working directory:

plastid.genomic.fna.gz: Compressed FASTA file containing all plastid genome sequences from NCBI RefSeq. This includes chloroplast genomes from plants, algae, and some bacteria.
plastid.genomic.gbff.gz: Compressed GenBank format file containing all plastid genome annotations. Includes detailed feature annotations, gene predictions, and metadata for each genome.
plastid.genomic.gff.gz: Compressed GFF3 format file converted from the GenBank annotations. This standardized format is compatible with most genomic analysis tools and visualization software.

Data Characteristics

Content Coverage

The NCBI RefSeq plastid collection includes:

Chloroplast genomes: From land plants, green algae, red algae, and other photosynthetic organisms
Apicoplast genomes: From apicomplexan parasites (Plasmodium, Toxoplasma, etc.)
Other plastids: Various specialized plastid types from diverse organisms
Size range: Typically 120-200 kb for most chloroplasts, but can vary significantly

Annotation Features

The GenBank files contain comprehensive annotations including:

Protein-coding genes (photosystem components, ribosomal proteins, etc.)
rRNA genes (16S, 23S, 4.5S ribosomal RNAs)
tRNA genes (complete set for plastid protein synthesis)
Pseudogenes and gene fragments
Introns and regulatory elements
Origin of replication regions

Common Use Cases

Comparative plastid genomics: Analyze genome structure and gene content across species
Phylogenetic reconstruction: Use plastid genes for plant and algal phylogenies
Reference database construction: Build local databases for plastid sequence identification
Genome annotation: Use well-annotated plastids as references for new genome annotation
Evolutionary studies: Study plastid genome evolution, gene loss, and rearrangements
Metagenomic analysis: Screen environmental samples for plastid sequences
Primer design: Design PCR primers for plastid-specific amplification

Performance Considerations

Download Time

Download time depends on:

Network speed: Several GB of data transfer required
NCBI server load: Peak usage times may slow downloads
Geographic location: Distance from NCBI servers affects transfer speed
Typical duration: 30 minutes to several hours depending on connection

Storage Requirements

Raw data: 2-4 GB for compressed files
Processed data: Additional space needed if files are decompressed
Temporary space: wget may use additional temporary storage during transfer

Processing Performance

The gbff2gff conversion step is typically fast (minutes) compared to the download phase. Memory usage is minimal for the conversion process.

Post-Processing Options

Data Filtering

After download, you may want to filter the data:

# Extract specific taxonomic groups
filterbyname.sh in=plastid.genomic.fna.gz out=plant_chloroplasts.fna.gz names=chloroplast include=t

# Filter by size (remove very small or incomplete genomes)
filterbylength.sh in=plastid.genomic.fna.gz out=complete_plastids.fna.gz minlen=100000

Database Construction

Create searchable databases:

# Create BBMap index for mapping
bbmap.sh ref=plastid.genomic.fna.gz

# Create sketch database for rapid similarity search
bbsketch.sh in=plastid.genomic.fna.gz out=plastid_sketches.sketch

Sequence Analysis

Analyze the downloaded genomes:

# Generate assembly statistics
stats.sh in=plastid.genomic.fna.gz

# Analyze GC content and composition
countgc.sh in=plastid.genomic.fna.gz

Troubleshooting

Network Issues

Timeout errors: NCBI servers may be busy; retry during off-peak hours
Incomplete downloads: Check file sizes; re-run pipeline if files are truncated
FTP access blocked: Some networks block FTP; use HTTP mirrors if available

Storage Issues

Disk space: Monitor available space during download
Permission errors: Ensure write access to the target directory
File corruption: Verify file integrity with gzip -t for compressed files

Processing Issues

gbff2gff failures: Check that GenBank file downloaded completely
Memory errors: Increase JVM memory if processing very large files
Format errors: NCBI occasionally updates file formats; check for tool updates

Related Tools

fetchproks.sh: Download prokaryotic genomes with quality selection
gbff2gff.sh: Convert GenBank to GFF3 format
gi2taxid.sh: Add taxonomic information to sequence headers
filterbyname.sh: Filter sequences by taxonomy or annotation
sendsketch.sh: Taxonomic identification of sequences

Notes

Data currency: Downloads reflect current NCBI RefSeq content, which updates regularly
File organization: All plastid genomes are concatenated into single files for convenience
Quality control: RefSeq data has undergone NCBI quality assessment
Licensing: NCBI data is in the public domain and freely redistributable
Citations: Consider citing NCBI RefSeq when publishing results from these data
Updates: Re-run pipeline periodically to get newly submitted genomes

Technical Implementation

Download Strategy

The pipeline uses wget with specific optimizations:

Quiet mode (-q): Suppresses wget verbose output for cleaner logs
Standard output (-O -): Pipes download directly to avoid temporary files
Wildcard expansion: NCBI FTP server handles the * wildcard to match all files
Concatenation: All matching files are automatically concatenated during download

File Format Handling

The pipeline handles two complementary data types:

FASTA (.fna.gz): Sequence data for alignment, analysis, and database construction
GenBank (.gbff.gz): Rich annotation data including gene boundaries, product names, and regulatory features
GFF3 conversion: Creates standardized annotation format compatible with genome browsers and analysis pipelines

Conversion Process

The gbff2gff.sh conversion implements selective feature extraction:

Focuses on biologically relevant features (genes, CDS, rRNA, tRNA)
Maintains coordinate accuracy from GenBank to GFF3
Preserves essential attributes like gene names and products
Filters out sequence data to keep annotations separate

Data Sources and Content

NCBI RefSeq Plastid Collection

The source directory contains genomes from:

Land plants: Angiosperms, gymnosperms, ferns, mosses, liverworts
Green algae: Charophytes, chlorophytes, and other green algal lineages
Red algae: Rhodophytes and related lineages
Other algae: Cryptophytes, haptophytes, heterokonts with secondary plastids
Apicomplexans: Malaria parasites and relatives with apicoplasts
Other organisms: Any organism with plastid-type organelles

Genome Characteristics

Plastid genomes typically exhibit:

Size range: 120-250 kb for most chloroplasts, smaller for apicoplasts
Gene content: 100-120 genes including photosynthesis, transcription, and translation machinery
Structure: Usually circular, often with large inverted repeats
Organization: Conserved gene order in many lineages, with notable exceptions

Integration with BBTools

Downstream Analysis

The downloaded data integrates seamlessly with BBTools workflows:

Mapping and Alignment

# Create reference index
bbmap.sh ref=plastid.genomic.fna.gz

# Map reads to plastid references
bbmap.sh in=reads.fq ref=plastid.genomic.fna.gz out=mapped.sam

Similarity Searching

# Create sketch database
bbsketch.sh in=plastid.genomic.fna.gz out=plastid.sketch

# Query unknown sequences
comparesketch.sh in=unknown.fa ref=plastid.sketch

Taxonomic Classification

# Classify sequences using plastid references
sendsketch.sh in=query.fasta db=nt coverage=50

Annotation Analysis

The GFF3 files support various annotation workflows:

Gene extraction: Extract specific gene types (rRNA, photosystem genes, etc.)
Comparative annotation: Compare gene content across species
Synteny analysis: Study gene order conservation
Functional analysis: Analyze gene functions and pathways

Research Applications

Phylogenetic Studies

Construct large-scale plastid phylogenies
Analyze deep evolutionary relationships
Study plastid genome evolution across lineages
Investigate endosymbiotic evolution

Comparative Genomics

Compare plastid genome structures across species
Identify conserved and variable regions
Study gene loss and acquisition patterns
Analyze genome rearrangements

Functional Genomics

Annotate newly sequenced plastid genomes
Identify photosynthesis-related genes
Study plastid gene expression patterns
Analyze metabolic pathway evolution

Environmental Studies

Identify plastid sequences in environmental samples
Study photosynthetic community composition
Track plastid diversity in ecosystems
Analyze agricultural crop plastid variation

Quality Assurance

Data Verification

After running the pipeline, verify data integrity:

# Check file completeness
ls -lh plastid.genomic.*

# Test compression integrity
gzip -t plastid.genomic.*.gz

# Count sequences in FASTA
grep -c "^>" plastid.genomic.fna

# Verify GFF3 format
head -20 plastid.genomic.gff

Expected Results

Successful pipeline execution should produce:

Three output files with reasonable sizes (GB range)
Valid compressed formats that pass integrity checks
Thousands of plastid genome sequences
Properly formatted GFF3 with standard headers

Maintenance and Updates

Regular Updates

Consider periodic re-runs to capture:

Newly submitted plastid genomes
Updated annotations from existing genomes
Corrections and improvements to RefSeq entries
New species and taxonomic groups

Version Tracking

Track download dates and versions:

# Add timestamp to output files
DATE=$(date +%Y%m%d)
mv plastid.genomic.fna.gz plastid.genomic.${DATE}.fna.gz

Alternative Approaches

Selective Downloads

For more targeted downloads, consider using fetchproks.sh with specific parameters or manual wget commands for specific taxonomic groups.

Format Alternatives

If you prefer different annotation formats:

Keep GenBank format for detailed annotations
Convert to other formats using external tools
Extract specific features using BBTools filtering tools

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org