GBFF2GFF

Basic Usage

gbff2gff.sh <gbff file> <gff file>

This tool converts GenBank flat file format (GBFF) to GFF3 format, extracting only selected features that are commonly of interest for genome annotation workflows.

Parameters

This tool uses positional arguments and does not have traditional parameter flags.

Positional Arguments

<gbff file>: Input GenBank flat file (GBFF format). The file can contain one or more locus records with features to be extracted.
<gff file>: Output GFF3 file path. If not specified, defaults to "stdout.gff".

Memory Configuration

-Xmx1g: Default maximum memory allocation. Can be overridden using standard BBTools memory configuration options.

Examples

Basic Conversion

gbff2gff.sh genome.gbff genome.gff

Converts a GenBank flat file to GFF3 format, extracting annotated features.

Output to Standard Output

gbff2gff.sh genome.gbff

Converts to GFF3 and writes to stdout.gff in the current directory.

Processing Multiple Genomes

gbff2gff.sh bacterial_genome.gbff bacterial_annotation.gff
gbff2gff.sh viral_genome.gbff viral_annotation.gff

Process multiple genome files separately to create individual GFF3 annotation files.

Algorithm Details

The gbff2gff tool implements a streaming GenBank parser using the GbffFile.java class, which provides sequential processing of LOCUS records through ByteFile line-by-line reading. The parser uses a three-class architecture: GbffFile for file-level operations, GbffLocus for block-based parsing, and GbffFeature for coordinate extraction.

Parsing Strategy

The tool employs a multi-class parsing implementation:

Sequential LOCUS processing: GbffFile.nextLocus() method processes each LOCUS record independently using Tools.startsWith() for block detection
Block-structured parsing: GbffLocus.parseBlock() method uses specific parsing methods (parseLocus(), parseDefinition(), parseAccession(), parseVersion(), parseFeatures()) for standard GenBank sections
Feature-selective extraction: GbffLocus.parseFeatures() filters features using Tools.find() against the predefined featureTypes array from GbffFeature.typeStrings
Coordinate preservation: GbffFeature.parseStartStop() method extracts exact genomic coordinates using digit-by-digit parsing with Tools.isDigit()

Supported Feature Types

The tool extracts features based on the hardcoded GbffFeature.typeStrings array:

CDS: Protein-coding sequences processed via GbffFeature constructor with product and locus_tag extraction
rRNA: Ribosomal RNA genes with automatic subtype classification using setSubtype() method for 5S, 16S, 23S detection
tRNA: Transfer RNA genes with amino acid specificity preserved from GenBank qualifiers
gene: Gene boundaries with pseudogene detection using pseudo boolean flag
ncRNA: Non-coding RNA features processed through the same coordinate extraction pipeline
UTR regions: 5'UTR and 3'UTR converted to GFF3 standard names (five_prime_UTR, three_prime_UTR) via typeStringsGff array

Feature Quality Control

The algorithm implements several quality control measures:

Coordinate validation: Features with invalid coordinates (stop < start) are automatically excluded
Pseudogene handling: Pseudogenes are properly classified and marked in the GFF3 output
Strand detection: Automatically detects complement() and join() annotations for proper strand assignment
Error recovery: Malformed features are logged but don't halt processing of valid features

GFF3 Output Format

The generated GFF3 files include:

Standard header: GFF version declaration and BBTools version information
Sequence regions: ##sequence-region declarations for each processed locus
Feature records: Tab-delimited feature annotations with attributes (product, locus_tag, subtype)
Attribute preservation: Key biological attributes (product names, locus tags) are maintained from the original GBFF

Performance Characteristics

The tool is optimized for:

Memory efficiency: Default 1GB allocation suitable for most bacterial and viral genomes
Streaming processing: Processes large GBFF files without loading entire file into memory
Selective parsing: Skips non-essential blocks (ORIGIN sequences, REFERENCE data) to improve speed
Error tolerance: Continues processing even when individual features contain parsing errors

Use Cases

This tool is particularly useful for:

Converting NCBI RefSeq annotations to GFF3 for genome browsers
Preparing annotation data for comparative genomics pipelines
Extracting specific feature types from complex GenBank records
Quality control of genome annotations through format conversion

Output Format

The generated GFF3 files follow standard specifications with BBTools-specific enhancements:

Header Structure

##gff-version 3
#BBTools [version] GbffToGff
#seqid	source	type	start	end	score	strand	phase	attributes

Feature Record Format

Each feature line contains nine tab-separated fields:

seqid: Sequence identifier (accession number from GBFF)
source: Always set to '.' (not specified)
type: Feature type (CDS, rRNA, tRNA, gene, etc.)
start: Start coordinate (1-based)
end: End coordinate (inclusive)
score: Always set to '.' (not calculated)
strand: '+' or '-' for forward/reverse strand
phase: Always set to '.' (not calculated)
attributes: Key-value pairs (product, locus_tag, subtype)

Attribute Details

product: Protein or RNA product name from the original annotation
locus_tag: Systematic gene identifier when available
subtype: For rRNA genes, specifies the ribosomal subunit (5S, 16S, 23S)

Technical Notes

File Format Support

Input files must be in standard GenBank flat file format (.gbff or .gb)
Compressed input files (.gz) are automatically detected and supported
Multi-locus files (multiple LOCUS records) are fully supported

Coordinate System

GenBank uses 1-based coordinates, which are preserved in GFF3 output
Feature coordinates include both endpoints (inclusive)
Strand information is automatically inferred from complement() annotations

Limitations

Only processes feature types defined in the internal whitelist
Does not extract sequence data (ORIGIN sections are skipped)
Complex feature relationships (gene-CDS hierarchies) are flattened
Some GenBank qualifiers may be omitted if not in the core attribute set

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org