Fetch Plastid Pipeline
Pipeline script to download all plastid genomes and annotations from NCBI RefSeq and convert them to GFF format. This streamlined pipeline fetches chloroplast/plastid sequences in both FASTA and GenBank formats, then converts the annotations to standardized GFF3 format.
Overview
This pipeline is designed for researchers working with plastid genomes (chloroplasts and other plastid organelles). It performs a complete download of all plastid genomes from NCBI's RefSeq database and converts the GenBank annotations to GFF3 format for downstream analysis. The pipeline is particularly useful for comparative plastid genomics, phylogenetic studies, and annotation projects.
Pipeline Steps
1. Download Plastid Genomes (FASTA)
wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plastid/*genomic.fna.gz > plastid.genomic.fna.gz
Downloads all plastid genome sequences in FASTA format from NCBI RefSeq. The wildcard (*) captures all available plastid genomes, and the output is concatenated into a single compressed file.
2. Download Plastid Annotations (GenBank)
wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plastid/*genomic.gbff.gz > plastid.genomic.gbff.gz
Downloads all plastid genome annotations in GenBank format. These files contain detailed feature annotations including genes, CDS, rRNA, tRNA, and other genomic elements.
3. Convert GenBank to GFF3
gbff2gff.sh plastid.genomic.gbff.gz plastid.genomic.gff.gz
Converts the GenBank format annotations to standardized GFF3 format using the gbff2gff tool. This creates a more standardized annotation format suitable for most genomic analysis pipelines.
Usage
# Run the complete pipeline
bash pipelines/fetch/fetchPlastid.sh
# Or execute individual steps:
cd /path/to/output/directory
bash /path/to/BBTools/pipelines/fetch/fetchPlastid.sh
The pipeline requires no command-line arguments and downloads files to the current working directory.
Prerequisites
System Requirements
- BBTools suite installed with gbff2gff.sh available
- wget for downloading files from NCBI FTP
- Internet connection with access to NCBI FTP servers
- Sufficient disk space (several GB for all plastid genomes)
Network Access
- Access to ftp.ncbi.nih.gov on port 21 (FTP)
- Ability to download large files (multi-GB transfers)
- Stable internet connection for extended download periods
Output Files
The pipeline generates three primary output files in the current working directory:
- plastid.genomic.fna.gz
- Compressed FASTA file containing all plastid genome sequences from NCBI RefSeq. This includes chloroplast genomes from plants, algae, and some bacteria.
- plastid.genomic.gbff.gz
- Compressed GenBank format file containing all plastid genome annotations. Includes detailed feature annotations, gene predictions, and metadata for each genome.
- plastid.genomic.gff.gz
- Compressed GFF3 format file converted from the GenBank annotations. This standardized format is compatible with most genomic analysis tools and visualization software.
Data Characteristics
Content Coverage
The NCBI RefSeq plastid collection includes:
- Chloroplast genomes: From land plants, green algae, red algae, and other photosynthetic organisms
- Apicoplast genomes: From apicomplexan parasites (Plasmodium, Toxoplasma, etc.)
- Other plastids: Various specialized plastid types from diverse organisms
- Size range: Typically 120-200 kb for most chloroplasts, but can vary significantly
Annotation Features
The GenBank files contain comprehensive annotations including:
- Protein-coding genes (photosystem components, ribosomal proteins, etc.)
- rRNA genes (16S, 23S, 4.5S ribosomal RNAs)
- tRNA genes (complete set for plastid protein synthesis)
- Pseudogenes and gene fragments
- Introns and regulatory elements
- Origin of replication regions
Common Use Cases
- Comparative plastid genomics: Analyze genome structure and gene content across species
- Phylogenetic reconstruction: Use plastid genes for plant and algal phylogenies
- Reference database construction: Build local databases for plastid sequence identification
- Genome annotation: Use well-annotated plastids as references for new genome annotation
- Evolutionary studies: Study plastid genome evolution, gene loss, and rearrangements
- Metagenomic analysis: Screen environmental samples for plastid sequences
- Primer design: Design PCR primers for plastid-specific amplification
Performance Considerations
Download Time
Download time depends on:
- Network speed: Several GB of data transfer required
- NCBI server load: Peak usage times may slow downloads
- Geographic location: Distance from NCBI servers affects transfer speed
- Typical duration: 30 minutes to several hours depending on connection
Storage Requirements
- Raw data: 2-4 GB for compressed files
- Processed data: Additional space needed if files are decompressed
- Temporary space: wget may use additional temporary storage during transfer
Processing Performance
The gbff2gff conversion step is typically fast (minutes) compared to the download phase. Memory usage is minimal for the conversion process.
Post-Processing Options
Data Filtering
After download, you may want to filter the data:
# Extract specific taxonomic groups
filterbyname.sh in=plastid.genomic.fna.gz out=plant_chloroplasts.fna.gz names=chloroplast include=t
# Filter by size (remove very small or incomplete genomes)
filterbylength.sh in=plastid.genomic.fna.gz out=complete_plastids.fna.gz minlen=100000
Database Construction
Create searchable databases:
# Create BBMap index for mapping
bbmap.sh ref=plastid.genomic.fna.gz
# Create sketch database for rapid similarity search
bbsketch.sh in=plastid.genomic.fna.gz out=plastid_sketches.sketch
Sequence Analysis
Analyze the downloaded genomes:
# Generate assembly statistics
stats.sh in=plastid.genomic.fna.gz
# Analyze GC content and composition
countgc.sh in=plastid.genomic.fna.gz
Troubleshooting
Network Issues
- Timeout errors: NCBI servers may be busy; retry during off-peak hours
- Incomplete downloads: Check file sizes; re-run pipeline if files are truncated
- FTP access blocked: Some networks block FTP; use HTTP mirrors if available
Storage Issues
- Disk space: Monitor available space during download
- Permission errors: Ensure write access to the target directory
- File corruption: Verify file integrity with gzip -t for compressed files
Processing Issues
- gbff2gff failures: Check that GenBank file downloaded completely
- Memory errors: Increase JVM memory if processing very large files
- Format errors: NCBI occasionally updates file formats; check for tool updates
Related Tools
- fetchproks.sh: Download prokaryotic genomes with quality selection
- gbff2gff.sh: Convert GenBank to GFF3 format
- gi2taxid.sh: Add taxonomic information to sequence headers
- filterbyname.sh: Filter sequences by taxonomy or annotation
- sendsketch.sh: Taxonomic identification of sequences
Notes
- Data currency: Downloads reflect current NCBI RefSeq content, which updates regularly
- File organization: All plastid genomes are concatenated into single files for convenience
- Quality control: RefSeq data has undergone NCBI quality assessment
- Licensing: NCBI data is in the public domain and freely redistributable
- Citations: Consider citing NCBI RefSeq when publishing results from these data
- Updates: Re-run pipeline periodically to get newly submitted genomes
Technical Implementation
Download Strategy
The pipeline uses wget with specific optimizations:
- Quiet mode (-q): Suppresses wget verbose output for cleaner logs
- Standard output (-O -): Pipes download directly to avoid temporary files
- Wildcard expansion: NCBI FTP server handles the * wildcard to match all files
- Concatenation: All matching files are automatically concatenated during download
File Format Handling
The pipeline handles two complementary data types:
- FASTA (.fna.gz): Sequence data for alignment, analysis, and database construction
- GenBank (.gbff.gz): Rich annotation data including gene boundaries, product names, and regulatory features
- GFF3 conversion: Creates standardized annotation format compatible with genome browsers and analysis pipelines
Conversion Process
The gbff2gff.sh conversion implements selective feature extraction:
- Focuses on biologically relevant features (genes, CDS, rRNA, tRNA)
- Maintains coordinate accuracy from GenBank to GFF3
- Preserves essential attributes like gene names and products
- Filters out sequence data to keep annotations separate
Data Sources and Content
NCBI RefSeq Plastid Collection
The source directory contains genomes from:
- Land plants: Angiosperms, gymnosperms, ferns, mosses, liverworts
- Green algae: Charophytes, chlorophytes, and other green algal lineages
- Red algae: Rhodophytes and related lineages
- Other algae: Cryptophytes, haptophytes, heterokonts with secondary plastids
- Apicomplexans: Malaria parasites and relatives with apicoplasts
- Other organisms: Any organism with plastid-type organelles
Genome Characteristics
Plastid genomes typically exhibit:
- Size range: 120-250 kb for most chloroplasts, smaller for apicoplasts
- Gene content: 100-120 genes including photosynthesis, transcription, and translation machinery
- Structure: Usually circular, often with large inverted repeats
- Organization: Conserved gene order in many lineages, with notable exceptions
Integration with BBTools
Downstream Analysis
The downloaded data integrates seamlessly with BBTools workflows:
Mapping and Alignment
# Create reference index
bbmap.sh ref=plastid.genomic.fna.gz
# Map reads to plastid references
bbmap.sh in=reads.fq ref=plastid.genomic.fna.gz out=mapped.sam
Similarity Searching
# Create sketch database
bbsketch.sh in=plastid.genomic.fna.gz out=plastid.sketch
# Query unknown sequences
comparesketch.sh in=unknown.fa ref=plastid.sketch
Taxonomic Classification
# Classify sequences using plastid references
sendsketch.sh in=query.fasta db=nt coverage=50
Annotation Analysis
The GFF3 files support various annotation workflows:
- Gene extraction: Extract specific gene types (rRNA, photosystem genes, etc.)
- Comparative annotation: Compare gene content across species
- Synteny analysis: Study gene order conservation
- Functional analysis: Analyze gene functions and pathways
Research Applications
Phylogenetic Studies
- Construct large-scale plastid phylogenies
- Analyze deep evolutionary relationships
- Study plastid genome evolution across lineages
- Investigate endosymbiotic evolution
Comparative Genomics
- Compare plastid genome structures across species
- Identify conserved and variable regions
- Study gene loss and acquisition patterns
- Analyze genome rearrangements
Functional Genomics
- Annotate newly sequenced plastid genomes
- Identify photosynthesis-related genes
- Study plastid gene expression patterns
- Analyze metabolic pathway evolution
Environmental Studies
- Identify plastid sequences in environmental samples
- Study photosynthetic community composition
- Track plastid diversity in ecosystems
- Analyze agricultural crop plastid variation
Quality Assurance
Data Verification
After running the pipeline, verify data integrity:
# Check file completeness
ls -lh plastid.genomic.*
# Test compression integrity
gzip -t plastid.genomic.*.gz
# Count sequences in FASTA
grep -c "^>" plastid.genomic.fna
# Verify GFF3 format
head -20 plastid.genomic.gff
Expected Results
Successful pipeline execution should produce:
- Three output files with reasonable sizes (GB range)
- Valid compressed formats that pass integrity checks
- Thousands of plastid genome sequences
- Properly formatted GFF3 with standard headers
Maintenance and Updates
Regular Updates
Consider periodic re-runs to capture:
- Newly submitted plastid genomes
- Updated annotations from existing genomes
- Corrections and improvements to RefSeq entries
- New species and taxonomic groups
Version Tracking
Track download dates and versions:
# Add timestamp to output files
DATE=$(date +%Y%m%d)
mv plastid.genomic.fna.gz plastid.genomic.${DATE}.fna.gz
Alternative Approaches
Selective Downloads
For more targeted downloads, consider using fetchproks.sh with specific parameters or manual wget commands for specific taxonomic groups.
Format Alternatives
If you prefer different annotation formats:
- Keep GenBank format for detailed annotations
- Convert to other formats using external tools
- Extract specific features using BBTools filtering tools
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org