Fetch Plasmid Pipeline

Overview

The Fetch Plasmid pipeline provides a simple, automated approach to obtaining comprehensive plasmid sequence data from NCBI's RefSeq database. Plasmids are self-replicating genetic elements found primarily in bacteria and archaea that often carry important genes for antibiotic resistance, virulence factors, metabolic pathways, and horizontal gene transfer mechanisms.

This pipeline downloads all available RefSeq plasmid genomes in two formats: the complete genomic sequences (FASTA) for sequence analysis, and the annotated GenBank files (GBFF) for feature annotation studies. The GenBank annotations are then converted to the more widely-used GFF3 format for compatibility with modern genomic analysis tools.

Note: This pipeline downloads all RefSeq plasmid data, which can be substantial in size (hundreds of MB to several GB). Ensure adequate network bandwidth and disk space before running.

Prerequisites

System Requirements

BBTools suite installed with gbff2gff.sh utility
wget for FTP downloads
Network connectivity to NCBI FTP servers
Sufficient disk space for compressed genomic data (several GB recommended)
Java runtime environment (for gbff2gff conversion)

Network Requirements

Access to ftp://ftp.ncbi.nih.gov/genomes/refseq/plasmid/
Stable internet connection for potentially large file transfers
No proxy restrictions on FTP protocol

Pipeline Stages

1. GenBank Annotation Download

wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plasmid/*genomic.gbff.gz > plasmid.genomic.gbff.gz

Downloads all RefSeq plasmid genomes in GenBank flat file format, which contains:

Sequence data: Complete nucleotide sequences for each plasmid
Annotation features: Genes, CDS, tRNA, rRNA, and other genomic features
Taxonomic information: Source organism and taxonomic classification
Cross-references: Links to protein databases, publications, and other resources
Metadata: Assembly information, submission details, and quality metrics

2. Nucleotide Sequence Download

wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plasmid/*genomic.fna.gz > plasmid.genomic.fna.gz

Downloads the same plasmid genomes in FASTA nucleotide format for sequence-only analysis:

Clean sequences: Pure nucleotide data without annotation complexity
Compatibility: Standard FASTA format for use with alignment and analysis tools
Performance: Faster loading and processing for sequence-only operations
Redundancy: Backup format in case annotation parsing fails

3. Annotation Format Conversion

gbff2gff.sh plasmid.genomic.gbff.gz plasmid.genomic.gff.gz

Converts GenBank flat file annotations to GFF3 format using BBTools' specialized parser:

Selective extraction: Focuses on important feature types (CDS, tRNA, rRNA)
Format standardization: Creates GFF3-compliant output with proper headers
Coordinate precision: Maintains exact genomic coordinates for all features
Attribute preservation: Retains essential feature attributes and cross-references

Basic Usage

# 1. Navigate to a working directory with sufficient space
cd /path/to/working/directory

# 2. Run the pipeline
bash fetchPlasmid.sh

# 3. Monitor download progress and file sizes
ls -lh plasmid.genomic.*

Warning: The pipeline downloads all RefSeq plasmid data, which can be several GB compressed. Ensure adequate disk space and be patient during the download process.

Output Files

Primary Downloads

plasmid.genomic.gbff.gz - All RefSeq plasmid genomes in GenBank flat file format with complete annotations
plasmid.genomic.fna.gz - All RefSeq plasmid genomes in FASTA nucleotide format

Processed Annotations

plasmid.genomic.gff.gz - GFF3 format annotations generated from GenBank files, containing essential features like genes, CDS, tRNA, and rRNA

File Content Details

GenBank Format (GBFF)

Contains comprehensive genomic information structured in blocks:

LOCUS: Basic sequence information, length, molecule type
DEFINITION: Descriptive title and organism information
ACCESSION/VERSION: Unique identifiers and version numbers
FEATURES: Detailed annotation of genes, proteins, and regulatory elements
ORIGIN: Complete nucleotide sequence data

FASTA Format (FNA)

Simple sequence format with minimal headers:

Headers: Sequence identifiers, organism names, and basic metadata
Sequences: Clean nucleotide data (A, T, G, C, N)
Compatibility: Universal format for sequence analysis tools

GFF3 Format

Standardized annotation format with nine tab-delimited columns:

Seqid: Sequence identifier matching FASTA headers
Source: "RefSeq" or annotation source
Type: Feature type (gene, CDS, tRNA, rRNA, etc.)
Start/End: Genomic coordinates (1-based, inclusive)
Score/Strand/Phase: Feature properties
Attributes: Additional feature information (gene names, products, etc.)

Algorithm Details

Download Strategy

The pipeline uses efficient batch downloading from NCBI's FTP servers:

Parallel Format Acquisition

Simultaneous download of complementary data formats:

Wildcard expansion: Uses * pattern to capture all plasmid files automatically
Stream redirection: Direct piping to output files prevents intermediate storage
Quiet mode: Suppresses wget progress output for clean execution
Compression preservation: Maintains gzip compression throughout transfer

Format Complementarity

The two formats serve different analytical purposes:

GBFF advantages: Rich annotation data, cross-references, detailed metadata
FNA advantages: Faster parsing, universal compatibility, sequence-only focus
GFF3 benefits: Standardized annotation format, tool compatibility, structured attributes

GenBank to GFF3 Conversion Algorithm

The gbff2gff.sh tool implements selective feature extraction:

Parsing Strategy

Block-based parsing of GenBank format:

Locus identification: Extracts accession numbers and sequence regions
Feature filtering: Selects only relevant feature types (CDS, tRNA, rRNA)
Coordinate extraction: Parses genomic positions with join/complement handling
Attribute mapping: Converts GenBank qualifiers to GFF3 attributes

Quality Control

Built-in validation during conversion:

Error detection: Skips malformed features that fail parsing
Pseudo-gene filtering: Excludes pseudo-genes marked in GenBank annotations
Format compliance: Ensures GFF3 specification adherence
Header preservation: Maintains essential metadata in GFF3 headers

Performance Characteristics

Memory usage: Minimal (1GB default for conversion), stream-based processing
Network efficiency: Single-connection downloads with built-in resume capability
Disk optimization: Maintains compression throughout pipeline
Processing speed: Fast conversion due to selective feature extraction
Scalability: Handles variable dataset sizes automatically

Plasmid Biology Context

Plasmid Characteristics

Plasmids are autonomous genetic elements with important biological roles:

Self-replication: Independent replication machinery
Horizontal transfer: Can move between bacterial cells
Accessory functions: Often carry non-essential but advantageous genes
Variable size: Range from 1kb to >1Mb depending on gene content
Copy number variation: Different plasmids maintain different cellular concentrations

Research Applications

Plasmid sequence analysis supports diverse research areas:

Antibiotic Resistance Studies

Resistance gene identification: Detection of beta-lactamases, aminoglycoside modifying enzymes
Resistance cassette analysis: Study of integron structures and gene arrangements
Transmission tracking: Following resistance spread through bacterial populations

Virulence Factor Analysis

Toxin genes: Identification of enterotoxins, cytotoxins, and other virulence factors
Adhesion factors: Genes encoding surface proteins for host cell binding
Immune evasion: Mechanisms for avoiding host immune responses

Metabolic Pathway Studies

Degradation pathways: Genes for breaking down complex organic compounds
Biosynthesis clusters: Secondary metabolite production genes
Metal resistance: Heavy metal tolerance and detoxification systems

Integration with BBTools Workflow

Using Downloaded Plasmid Data

The generated files can be integrated into various BBTools analyses:

Sequence Analysis

# Create sketch database for plasmid identification
sketch.sh in=plasmid.genomic.fna.gz out=plasmid.sketch

# Search query sequences against plasmid database
sendsketch.sh in=query.fq.gz ref=plasmid.sketch

# Map reads to plasmid sequences
bbmap.sh in=reads.fq.gz ref=plasmid.genomic.fna.gz

Annotation Analysis

# Extract specific feature types from GFF3
grep "CDS" plasmid.genomic.gff.gz | zcat

# Combine with sequence data for feature extraction
# (GFF3 coordinates can be used with sequence data for gene extraction)

Comparative Genomics

# Compare plasmid content between samples
dedupe.sh in=plasmid.genomic.fna.gz out=plasmid.unique.fa.gz

# Cluster similar plasmids
cluster.sh in=plasmid.genomic.fna.gz out=plasmid.clusters.fa.gz

Database Preparation for Classification

# Prepare plasmid database for taxonomic classification
bbmap.sh ref=plasmid.genomic.fna.gz

# Create k-mer index for rapid screening
sketch.sh in=plasmid.genomic.fna.gz out=plasmid_kmers.sketch size=10000

File Format Details

GenBank Flat File Format (GBFF)

Comprehensive genomic annotation format developed by NCBI:

Structure

LOCUS line: Sequence name, length, molecule type, division
DEFINITION: Human-readable description of the sequence
ACCESSION: Unique database identifier
FEATURES table: Detailed annotation of genomic features
ORIGIN section: Complete nucleotide sequence

Feature Types

Common plasmid features found in the GBFF files:

gene: Gene boundaries and basic information
CDS: Protein-coding sequences with translation details
rep_origin: Replication origin sequences
regulatory: Promoters, terminators, and control elements
mobile_element: Transposons and insertion sequences

GFF3 Format Conversion

The gbff2gff.sh conversion creates standardized GFF3 output:

Selective Feature Extraction

Only processes relevant feature types for downstream analysis:

CDS features: Protein-coding genes with complete coordinate and attribute information
tRNA features: Transfer RNA genes critical for translation machinery
rRNA features: Ribosomal RNA genes essential for protein synthesis
Quality filtering: Excludes pseudo-genes and malformed annotations

GFF3 Structure

Standard nine-column format with BBTools-specific enhancements:

Seqid: Plasmid accession number matching FASTA headers
Source: "RefSeq" indicating data source
Type: Feature type (CDS, tRNA, rRNA)
Start: Feature start position (1-based)
End: Feature end position (inclusive)
Score: Quality score (if available) or "."
Strand: + or - for feature orientation
Phase: Reading frame for CDS features
Attributes: Key=value pairs with gene names, products, etc.

Data Analysis Applications

Antimicrobial Resistance Research

Plasmid data is crucial for understanding resistance mechanisms:

Resistance gene cataloging: Comprehensive database of known resistance determinants
Genetic context analysis: Study of genes surrounding resistance elements
Horizontal transfer studies: Tracking resistance spread between bacterial species
Epidemiological surveillance: Monitoring resistance gene prevalence and distribution

Plasmid Typing and Classification

Sequence analysis enables plasmid characterization:

Incompatibility grouping: Classification based on replication machinery
Host range determination: Prediction of bacterial hosts based on sequence features
Evolutionary analysis: Phylogenetic studies of plasmid backbone sequences
Mobility assessment: Identification of transfer and mobilization genes

Functional Genomics

Annotation data supports functional studies:

Gene content analysis: Systematic cataloging of plasmid-encoded functions
Operon structure: Organization of functionally related genes
Regulatory element mapping: Promoters, terminators, and control sequences
Comparative genomics: Cross-plasmid function and structure comparison

Technical Implementation

Download Methodology

Efficient bulk data acquisition from NCBI:

FTP Wildcard Processing

The wildcard approach (*genomic.gbff.gz) captures all plasmid files:

Automatic discovery: No need to maintain file lists manually
Complete coverage: Ensures all available RefSeq plasmids are included
Future-proof: Automatically includes new plasmids added to RefSeq
Consistent naming: Leverages NCBI's standardized file naming conventions

Stream Concatenation

Direct pipeline from download to single compressed file:

Memory efficiency: No intermediate file storage during download
Disk space optimization: Single compressed output file
Atomic operation: Download completes fully or fails cleanly
Network resilience: wget handles connection issues automatically

Annotation Processing Algorithm

The GbffFile.java implementation uses sophisticated parsing:

Block-Structured Parsing

GenBank format is processed as hierarchical blocks:

Locus parsing: Extracts sequence identifiers and metadata
Feature block detection: Identifies annotation sections by indentation
Coordinate parsing: Handles complex location strings (joins, complements)
Attribute extraction: Processes qualifier strings for feature properties

Feature Selection Logic

Selective extraction based on biological importance:

Essential genes: CDS, tRNA, rRNA are critical for functional analysis
Quality control: Pseudo-genes are excluded to avoid false annotations
Error handling: Malformed features are logged and skipped
Format validation: Ensures output conforms to GFF3 specification

Performance Optimizations

Stream processing: No full file loading, processes data as it downloads
Compression maintenance: Keeps files compressed throughout pipeline
Minimal memory footprint: 1GB default memory allocation for conversion
Error resilience: Continues processing despite individual feature parsing failures

Troubleshooting

Download Issues

Network timeouts: NCBI FTP servers may experience high load
Incomplete downloads: Check file sizes and retry if necessary
Permission errors: Ensure write permissions in current directory
Disk space: Monitor available space during large downloads

Conversion Problems

Java memory errors: Increase -Xmx setting in gbff2gff.sh if needed
Malformed GenBank: Some entries may have parsing issues, conversion continues
Empty GFF output: Check that GenBank file contains supported feature types
Encoding issues: Ensure proper character encoding for international organism names

File Verification

# Check downloaded file integrity
zcat plasmid.genomic.gbff.gz | head -n 20

# Verify GFF3 conversion results
zcat plasmid.genomic.gff.gz | head -n 10

# Count features by type
zcat plasmid.genomic.gff.gz | grep -v "^#" | cut -f3 | sort | uniq -c

Advanced Usage

Selective Downloads

For targeted analysis, modify the wget commands:

# Download specific taxonomic groups (example: Enterobacteriaceae)
wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plasmid/bacteria/Enterobacteriaceae/*genomic.gbff.gz > enterobac_plasmids.gbff.gz

Custom Processing Workflows

Integrate with other BBTools for specialized analysis:

Resistance Gene Screening

# After running fetchPlasmid.sh:

# Extract CDS sequences for resistance gene annotation
# (Would require additional tools to extract CDS from coordinates)

# Screen for specific resistance genes
bbduk.sh in=sample.fq.gz ref=plasmid.genomic.fna.gz k=31 maskmiddle=f outm=plasmid_matches.fq.gz

Taxonomic Assignment

# Create taxonomic database for plasmid host identification
taxonomy.sh tree=auto -Xmx8g in=plasmid.genomic.fna.gz gi=auto

# Classify environmental plasmids
sendsketch.sh in=environmental_sample.fa.gz ref=plasmid.genomic.fna.gz

Database Maintenance

Keep plasmid data current with periodic updates:

Run the pipeline monthly or quarterly to capture new RefSeq releases
Compare new downloads with previous versions to identify additions
Update any downstream databases that depend on plasmid data
Archive previous versions for reproducibility of published analyses

Notes and Considerations

Data completeness: RefSeq plasmid collection represents well-characterized sequences but may not include all known plasmids
Annotation quality: RefSeq annotations undergo quality control but may vary in completeness between entries
File size growth: Plasmid databases grow continuously as new sequences are submitted
Network dependency: Pipeline requires internet access to NCBI servers
Compression efficiency: Files remain compressed to minimize storage requirements
Format compatibility: Generated files work with standard genomic analysis software
Update frequency: NCBI updates RefSeq regularly, consider periodic re-downloads
Licensing: NCBI data is public domain, freely usable for research

Related Tools and Pipelines

BBTools Utilities

gbff2gff.sh: GenBank to GFF3 conversion (used in this pipeline)
sketch.sh: Create k-mer databases for rapid sequence identification
sendsketch.sh: Query sequences against sketch databases
bbmap.sh: Align sequences to reference databases
filterbyname.sh: Extract sequences based on identifiers

Related Fetch Pipelines

fetchRefSeq.sh: Download complete RefSeq genomes
fetchMito.sh: Mitochondrial genome download
fetchPlastid.sh: Chloroplast genome download
fetchTaxonomy.sh: NCBI taxonomic database download
fetchSilva.sh: SILVA ribosomal RNA database preparation

External Dependencies

wget: File transfer utility (standard on most Unix systems)
Java: Required for gbff2gff conversion tool
gzip/zcat: Compression utilities for file manipulation