Fetch Plasmid Pipeline

Script: fetchPlasmid.sh Source Directory: pipelines/fetch/ Author: Brian Bushnell

Streamlined pipeline for downloading NCBI RefSeq plasmid genomes in both GenBank flat file format (GBFF) and FASTA nucleotide format (FNA), then converting the annotations to GFF3 format for downstream analysis and annotation studies.

Overview

The Fetch Plasmid pipeline provides a simple, automated approach to obtaining comprehensive plasmid sequence data from NCBI's RefSeq database. Plasmids are self-replicating genetic elements found primarily in bacteria and archaea that often carry important genes for antibiotic resistance, virulence factors, metabolic pathways, and horizontal gene transfer mechanisms.

This pipeline downloads all available RefSeq plasmid genomes in two formats: the complete genomic sequences (FASTA) for sequence analysis, and the annotated GenBank files (GBFF) for feature annotation studies. The GenBank annotations are then converted to the more widely-used GFF3 format for compatibility with modern genomic analysis tools.

Note: This pipeline downloads all RefSeq plasmid data, which can be substantial in size (hundreds of MB to several GB). Ensure adequate network bandwidth and disk space before running.

Prerequisites

System Requirements

Network Requirements

Pipeline Stages

1. GenBank Annotation Download

wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plasmid/*genomic.gbff.gz > plasmid.genomic.gbff.gz

Downloads all RefSeq plasmid genomes in GenBank flat file format, which contains:

2. Nucleotide Sequence Download

wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plasmid/*genomic.fna.gz > plasmid.genomic.fna.gz

Downloads the same plasmid genomes in FASTA nucleotide format for sequence-only analysis:

3. Annotation Format Conversion

gbff2gff.sh plasmid.genomic.gbff.gz plasmid.genomic.gff.gz

Converts GenBank flat file annotations to GFF3 format using BBTools' specialized parser:

Basic Usage

# 1. Navigate to a working directory with sufficient space
cd /path/to/working/directory

# 2. Run the pipeline
bash fetchPlasmid.sh

# 3. Monitor download progress and file sizes
ls -lh plasmid.genomic.*
Warning: The pipeline downloads all RefSeq plasmid data, which can be several GB compressed. Ensure adequate disk space and be patient during the download process.

Output Files

Primary Downloads

Processed Annotations

File Content Details

GenBank Format (GBFF)

Contains comprehensive genomic information structured in blocks:

FASTA Format (FNA)

Simple sequence format with minimal headers:

GFF3 Format

Standardized annotation format with nine tab-delimited columns:

Algorithm Details

Download Strategy

The pipeline uses efficient batch downloading from NCBI's FTP servers:

Parallel Format Acquisition

Simultaneous download of complementary data formats:

Format Complementarity

The two formats serve different analytical purposes:

GenBank to GFF3 Conversion Algorithm

The gbff2gff.sh tool implements selective feature extraction:

Parsing Strategy

Block-based parsing of GenBank format:

Quality Control

Built-in validation during conversion:

Performance Characteristics

Plasmid Biology Context

Plasmid Characteristics

Plasmids are autonomous genetic elements with important biological roles:

Research Applications

Plasmid sequence analysis supports diverse research areas:

Antibiotic Resistance Studies

Virulence Factor Analysis

Metabolic Pathway Studies

Integration with BBTools Workflow

Using Downloaded Plasmid Data

The generated files can be integrated into various BBTools analyses:

Sequence Analysis

# Create sketch database for plasmid identification
sketch.sh in=plasmid.genomic.fna.gz out=plasmid.sketch

# Search query sequences against plasmid database
sendsketch.sh in=query.fq.gz ref=plasmid.sketch

# Map reads to plasmid sequences
bbmap.sh in=reads.fq.gz ref=plasmid.genomic.fna.gz

Annotation Analysis

# Extract specific feature types from GFF3
grep "CDS" plasmid.genomic.gff.gz | zcat

# Combine with sequence data for feature extraction
# (GFF3 coordinates can be used with sequence data for gene extraction)

Comparative Genomics

# Compare plasmid content between samples
dedupe.sh in=plasmid.genomic.fna.gz out=plasmid.unique.fa.gz

# Cluster similar plasmids
cluster.sh in=plasmid.genomic.fna.gz out=plasmid.clusters.fa.gz

Database Preparation for Classification

# Prepare plasmid database for taxonomic classification
bbmap.sh ref=plasmid.genomic.fna.gz

# Create k-mer index for rapid screening
sketch.sh in=plasmid.genomic.fna.gz out=plasmid_kmers.sketch size=10000

File Format Details

GenBank Flat File Format (GBFF)

Comprehensive genomic annotation format developed by NCBI:

Structure

Feature Types

Common plasmid features found in the GBFF files:

GFF3 Format Conversion

The gbff2gff.sh conversion creates standardized GFF3 output:

Selective Feature Extraction

Only processes relevant feature types for downstream analysis:

GFF3 Structure

Standard nine-column format with BBTools-specific enhancements:

  1. Seqid: Plasmid accession number matching FASTA headers
  2. Source: "RefSeq" indicating data source
  3. Type: Feature type (CDS, tRNA, rRNA)
  4. Start: Feature start position (1-based)
  5. End: Feature end position (inclusive)
  6. Score: Quality score (if available) or "."
  7. Strand: + or - for feature orientation
  8. Phase: Reading frame for CDS features
  9. Attributes: Key=value pairs with gene names, products, etc.

Data Analysis Applications

Antimicrobial Resistance Research

Plasmid data is crucial for understanding resistance mechanisms:

Plasmid Typing and Classification

Sequence analysis enables plasmid characterization:

Functional Genomics

Annotation data supports functional studies:

Technical Implementation

Download Methodology

Efficient bulk data acquisition from NCBI:

FTP Wildcard Processing

The wildcard approach (*genomic.gbff.gz) captures all plasmid files:

Stream Concatenation

Direct pipeline from download to single compressed file:

Annotation Processing Algorithm

The GbffFile.java implementation uses sophisticated parsing:

Block-Structured Parsing

GenBank format is processed as hierarchical blocks:

Feature Selection Logic

Selective extraction based on biological importance:

Performance Optimizations

Troubleshooting

Download Issues

Conversion Problems

File Verification

# Check downloaded file integrity
zcat plasmid.genomic.gbff.gz | head -n 20

# Verify GFF3 conversion results
zcat plasmid.genomic.gff.gz | head -n 10

# Count features by type
zcat plasmid.genomic.gff.gz | grep -v "^#" | cut -f3 | sort | uniq -c

Advanced Usage

Selective Downloads

For targeted analysis, modify the wget commands:

# Download specific taxonomic groups (example: Enterobacteriaceae)
wget -q -O - ftp://ftp.ncbi.nih.gov/genomes/refseq/plasmid/bacteria/Enterobacteriaceae/*genomic.gbff.gz > enterobac_plasmids.gbff.gz

Custom Processing Workflows

Integrate with other BBTools for specialized analysis:

Resistance Gene Screening

# After running fetchPlasmid.sh:

# Extract CDS sequences for resistance gene annotation
# (Would require additional tools to extract CDS from coordinates)

# Screen for specific resistance genes
bbduk.sh in=sample.fq.gz ref=plasmid.genomic.fna.gz k=31 maskmiddle=f outm=plasmid_matches.fq.gz

Taxonomic Assignment

# Create taxonomic database for plasmid host identification
taxonomy.sh tree=auto -Xmx8g in=plasmid.genomic.fna.gz gi=auto

# Classify environmental plasmids
sendsketch.sh in=environmental_sample.fa.gz ref=plasmid.genomic.fna.gz

Database Maintenance

Keep plasmid data current with periodic updates:

  1. Run the pipeline monthly or quarterly to capture new RefSeq releases
  2. Compare new downloads with previous versions to identify additions
  3. Update any downstream databases that depend on plasmid data
  4. Archive previous versions for reproducibility of published analyses

Notes and Considerations

Related Tools and Pipelines

BBTools Utilities

Related Fetch Pipelines

External Dependencies