Fetch Plasmid Pipeline
Streamlined pipeline for downloading NCBI RefSeq plasmid genomes in both GenBank flat file format (GBFF) and FASTA nucleotide format (FNA), then converting the annotations to GFF3 format for downstream analysis and annotation studies.
Overview
The Fetch Plasmid pipeline provides a simple, automated approach to obtaining comprehensive plasmid sequence data from NCBI's RefSeq database. Plasmids are self-replicating genetic elements found primarily in bacteria and archaea that often carry important genes for antibiotic resistance, virulence factors, metabolic pathways, and horizontal gene transfer mechanisms.
This pipeline downloads all available RefSeq plasmid genomes in two formats: the complete genomic sequences (FASTA) for sequence analysis, and the annotated GenBank files (GBFF) for feature annotation studies. The GenBank annotations are then converted to the more widely-used GFF3 format for compatibility with modern genomic analysis tools.
Prerequisites
System Requirements
- BBTools suite installed with gbff2gff.sh utility
- wget for FTP downloads
- Network connectivity to NCBI FTP servers
- Sufficient disk space for compressed genomic data (several GB recommended)
- Java runtime environment (for gbff2gff conversion)
Network Requirements
- Access to ftp://ftp.ncbi.nih.gov/genomes/refseq/plasmid/
- Stable internet connection for potentially large file transfers
- No proxy restrictions on FTP protocol
Pipeline Stages
1. GenBank Annotation Download
wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plasmid/*genomic.gbff.gz > plasmid.genomic.gbff.gz
Downloads all RefSeq plasmid genomes in GenBank flat file format, which contains:
- Sequence data: Complete nucleotide sequences for each plasmid
- Annotation features: Genes, CDS, tRNA, rRNA, and other genomic features
- Taxonomic information: Source organism and taxonomic classification
- Cross-references: Links to protein databases, publications, and other resources
- Metadata: Assembly information, submission details, and quality metrics
2. Nucleotide Sequence Download
wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plasmid/*genomic.fna.gz > plasmid.genomic.fna.gz
Downloads the same plasmid genomes in FASTA nucleotide format for sequence-only analysis:
- Clean sequences: Pure nucleotide data without annotation complexity
- Compatibility: Standard FASTA format for use with alignment and analysis tools
- Performance: Faster loading and processing for sequence-only operations
- Redundancy: Backup format in case annotation parsing fails
3. Annotation Format Conversion
gbff2gff.sh plasmid.genomic.gbff.gz plasmid.genomic.gff.gz
Converts GenBank flat file annotations to GFF3 format using BBTools' specialized parser:
- Selective extraction: Focuses on important feature types (CDS, tRNA, rRNA)
- Format standardization: Creates GFF3-compliant output with proper headers
- Coordinate precision: Maintains exact genomic coordinates for all features
- Attribute preservation: Retains essential feature attributes and cross-references
Basic Usage
# 1. Navigate to a working directory with sufficient space
cd /path/to/working/directory
# 2. Run the pipeline
bash fetchPlasmid.sh
# 3. Monitor download progress and file sizes
ls -lh plasmid.genomic.*
Output Files
Primary Downloads
- plasmid.genomic.gbff.gz - All RefSeq plasmid genomes in GenBank flat file format with complete annotations
- plasmid.genomic.fna.gz - All RefSeq plasmid genomes in FASTA nucleotide format
Processed Annotations
- plasmid.genomic.gff.gz - GFF3 format annotations generated from GenBank files, containing essential features like genes, CDS, tRNA, and rRNA
File Content Details
GenBank Format (GBFF)
Contains comprehensive genomic information structured in blocks:
- LOCUS: Basic sequence information, length, molecule type
- DEFINITION: Descriptive title and organism information
- ACCESSION/VERSION: Unique identifiers and version numbers
- FEATURES: Detailed annotation of genes, proteins, and regulatory elements
- ORIGIN: Complete nucleotide sequence data
FASTA Format (FNA)
Simple sequence format with minimal headers:
- Headers: Sequence identifiers, organism names, and basic metadata
- Sequences: Clean nucleotide data (A, T, G, C, N)
- Compatibility: Universal format for sequence analysis tools
GFF3 Format
Standardized annotation format with nine tab-delimited columns:
- Seqid: Sequence identifier matching FASTA headers
- Source: "RefSeq" or annotation source
- Type: Feature type (gene, CDS, tRNA, rRNA, etc.)
- Start/End: Genomic coordinates (1-based, inclusive)
- Score/Strand/Phase: Feature properties
- Attributes: Additional feature information (gene names, products, etc.)
Algorithm Details
Download Strategy
The pipeline uses efficient batch downloading from NCBI's FTP servers:
Parallel Format Acquisition
Simultaneous download of complementary data formats:
- Wildcard expansion: Uses * pattern to capture all plasmid files automatically
- Stream redirection: Direct piping to output files prevents intermediate storage
- Quiet mode: Suppresses wget progress output for clean execution
- Compression preservation: Maintains gzip compression throughout transfer
Format Complementarity
The two formats serve different analytical purposes:
- GBFF advantages: Rich annotation data, cross-references, detailed metadata
- FNA advantages: Faster parsing, universal compatibility, sequence-only focus
- GFF3 benefits: Standardized annotation format, tool compatibility, structured attributes
GenBank to GFF3 Conversion Algorithm
The gbff2gff.sh tool implements selective feature extraction:
Parsing Strategy
Block-based parsing of GenBank format:
- Locus identification: Extracts accession numbers and sequence regions
- Feature filtering: Selects only relevant feature types (CDS, tRNA, rRNA)
- Coordinate extraction: Parses genomic positions with join/complement handling
- Attribute mapping: Converts GenBank qualifiers to GFF3 attributes
Quality Control
Built-in validation during conversion:
- Error detection: Skips malformed features that fail parsing
- Pseudo-gene filtering: Excludes pseudo-genes marked in GenBank annotations
- Format compliance: Ensures GFF3 specification adherence
- Header preservation: Maintains essential metadata in GFF3 headers
Performance Characteristics
- Memory usage: Minimal (1GB default for conversion), stream-based processing
- Network efficiency: Single-connection downloads with built-in resume capability
- Disk optimization: Maintains compression throughout pipeline
- Processing speed: Fast conversion due to selective feature extraction
- Scalability: Handles variable dataset sizes automatically
Plasmid Biology Context
Plasmid Characteristics
Plasmids are autonomous genetic elements with important biological roles:
- Self-replication: Independent replication machinery
- Horizontal transfer: Can move between bacterial cells
- Accessory functions: Often carry non-essential but advantageous genes
- Variable size: Range from 1kb to >1Mb depending on gene content
- Copy number variation: Different plasmids maintain different cellular concentrations
Research Applications
Plasmid sequence analysis supports diverse research areas:
Antibiotic Resistance Studies
- Resistance gene identification: Detection of beta-lactamases, aminoglycoside modifying enzymes
- Resistance cassette analysis: Study of integron structures and gene arrangements
- Transmission tracking: Following resistance spread through bacterial populations
Virulence Factor Analysis
- Toxin genes: Identification of enterotoxins, cytotoxins, and other virulence factors
- Adhesion factors: Genes encoding surface proteins for host cell binding
- Immune evasion: Mechanisms for avoiding host immune responses
Metabolic Pathway Studies
- Degradation pathways: Genes for breaking down complex organic compounds
- Biosynthesis clusters: Secondary metabolite production genes
- Metal resistance: Heavy metal tolerance and detoxification systems
Integration with BBTools Workflow
Using Downloaded Plasmid Data
The generated files can be integrated into various BBTools analyses:
Sequence Analysis
# Create sketch database for plasmid identification
sketch.sh in=plasmid.genomic.fna.gz out=plasmid.sketch
# Search query sequences against plasmid database
sendsketch.sh in=query.fq.gz ref=plasmid.sketch
# Map reads to plasmid sequences
bbmap.sh in=reads.fq.gz ref=plasmid.genomic.fna.gz
Annotation Analysis
# Extract specific feature types from GFF3
grep "CDS" plasmid.genomic.gff.gz | zcat
# Combine with sequence data for feature extraction
# (GFF3 coordinates can be used with sequence data for gene extraction)
Comparative Genomics
# Compare plasmid content between samples
dedupe.sh in=plasmid.genomic.fna.gz out=plasmid.unique.fa.gz
# Cluster similar plasmids
cluster.sh in=plasmid.genomic.fna.gz out=plasmid.clusters.fa.gz
Database Preparation for Classification
# Prepare plasmid database for taxonomic classification
bbmap.sh ref=plasmid.genomic.fna.gz
# Create k-mer index for rapid screening
sketch.sh in=plasmid.genomic.fna.gz out=plasmid_kmers.sketch size=10000
File Format Details
GenBank Flat File Format (GBFF)
Comprehensive genomic annotation format developed by NCBI:
Structure
- LOCUS line: Sequence name, length, molecule type, division
- DEFINITION: Human-readable description of the sequence
- ACCESSION: Unique database identifier
- FEATURES table: Detailed annotation of genomic features
- ORIGIN section: Complete nucleotide sequence
Feature Types
Common plasmid features found in the GBFF files:
- gene: Gene boundaries and basic information
- CDS: Protein-coding sequences with translation details
- rep_origin: Replication origin sequences
- regulatory: Promoters, terminators, and control elements
- mobile_element: Transposons and insertion sequences
GFF3 Format Conversion
The gbff2gff.sh conversion creates standardized GFF3 output:
Selective Feature Extraction
Only processes relevant feature types for downstream analysis:
- CDS features: Protein-coding genes with complete coordinate and attribute information
- tRNA features: Transfer RNA genes critical for translation machinery
- rRNA features: Ribosomal RNA genes essential for protein synthesis
- Quality filtering: Excludes pseudo-genes and malformed annotations
GFF3 Structure
Standard nine-column format with BBTools-specific enhancements:
- Seqid: Plasmid accession number matching FASTA headers
- Source: "RefSeq" indicating data source
- Type: Feature type (CDS, tRNA, rRNA)
- Start: Feature start position (1-based)
- End: Feature end position (inclusive)
- Score: Quality score (if available) or "."
- Strand: + or - for feature orientation
- Phase: Reading frame for CDS features
- Attributes: Key=value pairs with gene names, products, etc.
Data Analysis Applications
Antimicrobial Resistance Research
Plasmid data is crucial for understanding resistance mechanisms:
- Resistance gene cataloging: Comprehensive database of known resistance determinants
- Genetic context analysis: Study of genes surrounding resistance elements
- Horizontal transfer studies: Tracking resistance spread between bacterial species
- Epidemiological surveillance: Monitoring resistance gene prevalence and distribution
Plasmid Typing and Classification
Sequence analysis enables plasmid characterization:
- Incompatibility grouping: Classification based on replication machinery
- Host range determination: Prediction of bacterial hosts based on sequence features
- Evolutionary analysis: Phylogenetic studies of plasmid backbone sequences
- Mobility assessment: Identification of transfer and mobilization genes
Functional Genomics
Annotation data supports functional studies:
- Gene content analysis: Systematic cataloging of plasmid-encoded functions
- Operon structure: Organization of functionally related genes
- Regulatory element mapping: Promoters, terminators, and control sequences
- Comparative genomics: Cross-plasmid function and structure comparison
Technical Implementation
Download Methodology
Efficient bulk data acquisition from NCBI:
FTP Wildcard Processing
The wildcard approach (*genomic.gbff.gz) captures all plasmid files:
- Automatic discovery: No need to maintain file lists manually
- Complete coverage: Ensures all available RefSeq plasmids are included
- Future-proof: Automatically includes new plasmids added to RefSeq
- Consistent naming: Leverages NCBI's standardized file naming conventions
Stream Concatenation
Direct pipeline from download to single compressed file:
- Memory efficiency: No intermediate file storage during download
- Disk space optimization: Single compressed output file
- Atomic operation: Download completes fully or fails cleanly
- Network resilience: wget handles connection issues automatically
Annotation Processing Algorithm
The GbffFile.java implementation uses sophisticated parsing:
Block-Structured Parsing
GenBank format is processed as hierarchical blocks:
- Locus parsing: Extracts sequence identifiers and metadata
- Feature block detection: Identifies annotation sections by indentation
- Coordinate parsing: Handles complex location strings (joins, complements)
- Attribute extraction: Processes qualifier strings for feature properties
Feature Selection Logic
Selective extraction based on biological importance:
- Essential genes: CDS, tRNA, rRNA are critical for functional analysis
- Quality control: Pseudo-genes are excluded to avoid false annotations
- Error handling: Malformed features are logged and skipped
- Format validation: Ensures output conforms to GFF3 specification
Performance Optimizations
- Stream processing: No full file loading, processes data as it downloads
- Compression maintenance: Keeps files compressed throughout pipeline
- Minimal memory footprint: 1GB default memory allocation for conversion
- Error resilience: Continues processing despite individual feature parsing failures
Troubleshooting
Download Issues
- Network timeouts: NCBI FTP servers may experience high load
- Incomplete downloads: Check file sizes and retry if necessary
- Permission errors: Ensure write permissions in current directory
- Disk space: Monitor available space during large downloads
Conversion Problems
- Java memory errors: Increase -Xmx setting in gbff2gff.sh if needed
- Malformed GenBank: Some entries may have parsing issues, conversion continues
- Empty GFF output: Check that GenBank file contains supported feature types
- Encoding issues: Ensure proper character encoding for international organism names
File Verification
# Check downloaded file integrity
zcat plasmid.genomic.gbff.gz | head -n 20
# Verify GFF3 conversion results
zcat plasmid.genomic.gff.gz | head -n 10
# Count features by type
zcat plasmid.genomic.gff.gz | grep -v "^#" | cut -f3 | sort | uniq -c
Advanced Usage
Selective Downloads
For targeted analysis, modify the wget commands:
# Download specific taxonomic groups (example: Enterobacteriaceae)
wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plasmid/bacteria/Enterobacteriaceae/*genomic.gbff.gz > enterobac_plasmids.gbff.gz
Custom Processing Workflows
Integrate with other BBTools for specialized analysis:
Resistance Gene Screening
# After running fetchPlasmid.sh:
# Extract CDS sequences for resistance gene annotation
# (Would require additional tools to extract CDS from coordinates)
# Screen for specific resistance genes
bbduk.sh in=sample.fq.gz ref=plasmid.genomic.fna.gz k=31 maskmiddle=f outm=plasmid_matches.fq.gz
Taxonomic Assignment
# Create taxonomic database for plasmid host identification
taxonomy.sh tree=auto -Xmx8g in=plasmid.genomic.fna.gz gi=auto
# Classify environmental plasmids
sendsketch.sh in=environmental_sample.fa.gz ref=plasmid.genomic.fna.gz
Database Maintenance
Keep plasmid data current with periodic updates:
- Run the pipeline monthly or quarterly to capture new RefSeq releases
- Compare new downloads with previous versions to identify additions
- Update any downstream databases that depend on plasmid data
- Archive previous versions for reproducibility of published analyses
Notes and Considerations
- Data completeness: RefSeq plasmid collection represents well-characterized sequences but may not include all known plasmids
- Annotation quality: RefSeq annotations undergo quality control but may vary in completeness between entries
- File size growth: Plasmid databases grow continuously as new sequences are submitted
- Network dependency: Pipeline requires internet access to NCBI servers
- Compression efficiency: Files remain compressed to minimize storage requirements
- Format compatibility: Generated files work with standard genomic analysis software
- Update frequency: NCBI updates RefSeq regularly, consider periodic re-downloads
- Licensing: NCBI data is public domain, freely usable for research
Related Tools and Pipelines
BBTools Utilities
- gbff2gff.sh: GenBank to GFF3 conversion (used in this pipeline)
- sketch.sh: Create k-mer databases for rapid sequence identification
- sendsketch.sh: Query sequences against sketch databases
- bbmap.sh: Align sequences to reference databases
- filterbyname.sh: Extract sequences based on identifiers
Related Fetch Pipelines
- fetchRefSeq.sh: Download complete RefSeq genomes
- fetchMito.sh: Mitochondrial genome download
- fetchPlastid.sh: Chloroplast genome download
- fetchTaxonomy.sh: NCBI taxonomic database download
- fetchSilva.sh: SILVA ribosomal RNA database preparation
External Dependencies
- wget: File transfer utility (standard on most Unix systems)
- Java: Required for gbff2gff conversion tool
- gzip/zcat: Compression utilities for file manipulation