GBFF2GFF

Script: gbff2gff.sh Package: gff Class: GbffFile.java

Generates a GFF3 from a GBFF. Only for features I care about though.

Basic Usage

gbff2gff.sh <gbff file> <gff file>

This tool converts GenBank flat file format (GBFF) to GFF3 format, extracting only selected features that are commonly of interest for genome annotation workflows.

Parameters

This tool uses positional arguments and does not have traditional parameter flags.

Positional Arguments

<gbff file>
Input GenBank flat file (GBFF format). The file can contain one or more locus records with features to be extracted.
<gff file>
Output GFF3 file path. If not specified, defaults to "stdout.gff".

Memory Configuration

-Xmx1g
Default maximum memory allocation. Can be overridden using standard BBTools memory configuration options.

Examples

Basic Conversion

gbff2gff.sh genome.gbff genome.gff

Converts a GenBank flat file to GFF3 format, extracting annotated features.

Output to Standard Output

gbff2gff.sh genome.gbff

Converts to GFF3 and writes to stdout.gff in the current directory.

Processing Multiple Genomes

gbff2gff.sh bacterial_genome.gbff bacterial_annotation.gff
gbff2gff.sh viral_genome.gbff viral_annotation.gff

Process multiple genome files separately to create individual GFF3 annotation files.

Algorithm Details

The gbff2gff tool implements a streaming GenBank parser using the GbffFile.java class, which provides sequential processing of LOCUS records through ByteFile line-by-line reading. The parser uses a three-class architecture: GbffFile for file-level operations, GbffLocus for block-based parsing, and GbffFeature for coordinate extraction.

Parsing Strategy

The tool employs a multi-class parsing implementation:

Supported Feature Types

The tool extracts features based on the hardcoded GbffFeature.typeStrings array:

Feature Quality Control

The algorithm implements several quality control measures:

GFF3 Output Format

The generated GFF3 files include:

Performance Characteristics

The tool is optimized for:

Use Cases

This tool is particularly useful for:

Output Format

The generated GFF3 files follow standard specifications with BBTools-specific enhancements:

Header Structure

##gff-version 3
#BBTools [version] GbffToGff
#seqid	source	type	start	end	score	strand	phase	attributes

Feature Record Format

Each feature line contains nine tab-separated fields:

  1. seqid: Sequence identifier (accession number from GBFF)
  2. source: Always set to '.' (not specified)
  3. type: Feature type (CDS, rRNA, tRNA, gene, etc.)
  4. start: Start coordinate (1-based)
  5. end: End coordinate (inclusive)
  6. score: Always set to '.' (not calculated)
  7. strand: '+' or '-' for forward/reverse strand
  8. phase: Always set to '.' (not calculated)
  9. attributes: Key-value pairs (product, locus_tag, subtype)

Attribute Details

Technical Notes

File Format Support

Coordinate System

Limitations

Support

For questions and support: