GBFF2GFF
Generates a GFF3 from a GBFF. Only for features I care about though.
Basic Usage
gbff2gff.sh <gbff file> <gff file>
This tool converts GenBank flat file format (GBFF) to GFF3 format, extracting only selected features that are commonly of interest for genome annotation workflows.
Parameters
This tool uses positional arguments and does not have traditional parameter flags.
Positional Arguments
- <gbff file>
- Input GenBank flat file (GBFF format). The file can contain one or more locus records with features to be extracted.
- <gff file>
- Output GFF3 file path. If not specified, defaults to "stdout.gff".
Memory Configuration
- -Xmx1g
- Default maximum memory allocation. Can be overridden using standard BBTools memory configuration options.
Examples
Basic Conversion
gbff2gff.sh genome.gbff genome.gff
Converts a GenBank flat file to GFF3 format, extracting annotated features.
Output to Standard Output
gbff2gff.sh genome.gbff
Converts to GFF3 and writes to stdout.gff in the current directory.
Processing Multiple Genomes
gbff2gff.sh bacterial_genome.gbff bacterial_annotation.gff
gbff2gff.sh viral_genome.gbff viral_annotation.gff
Process multiple genome files separately to create individual GFF3 annotation files.
Algorithm Details
The gbff2gff tool implements a streaming GenBank parser using the GbffFile.java class, which provides sequential processing of LOCUS records through ByteFile line-by-line reading. The parser uses a three-class architecture: GbffFile for file-level operations, GbffLocus for block-based parsing, and GbffFeature for coordinate extraction.
Parsing Strategy
The tool employs a multi-class parsing implementation:
- Sequential LOCUS processing: GbffFile.nextLocus() method processes each LOCUS record independently using Tools.startsWith() for block detection
- Block-structured parsing: GbffLocus.parseBlock() method uses specific parsing methods (parseLocus(), parseDefinition(), parseAccession(), parseVersion(), parseFeatures()) for standard GenBank sections
- Feature-selective extraction: GbffLocus.parseFeatures() filters features using Tools.find() against the predefined featureTypes array from GbffFeature.typeStrings
- Coordinate preservation: GbffFeature.parseStartStop() method extracts exact genomic coordinates using digit-by-digit parsing with Tools.isDigit()
Supported Feature Types
The tool extracts features based on the hardcoded GbffFeature.typeStrings array:
- CDS: Protein-coding sequences processed via GbffFeature constructor with product and locus_tag extraction
- rRNA: Ribosomal RNA genes with automatic subtype classification using setSubtype() method for 5S, 16S, 23S detection
- tRNA: Transfer RNA genes with amino acid specificity preserved from GenBank qualifiers
- gene: Gene boundaries with pseudogene detection using pseudo boolean flag
- ncRNA: Non-coding RNA features processed through the same coordinate extraction pipeline
- UTR regions: 5'UTR and 3'UTR converted to GFF3 standard names (five_prime_UTR, three_prime_UTR) via typeStringsGff array
Feature Quality Control
The algorithm implements several quality control measures:
- Coordinate validation: Features with invalid coordinates (stop < start) are automatically excluded
- Pseudogene handling: Pseudogenes are properly classified and marked in the GFF3 output
- Strand detection: Automatically detects complement() and join() annotations for proper strand assignment
- Error recovery: Malformed features are logged but don't halt processing of valid features
GFF3 Output Format
The generated GFF3 files include:
- Standard header: GFF version declaration and BBTools version information
- Sequence regions: ##sequence-region declarations for each processed locus
- Feature records: Tab-delimited feature annotations with attributes (product, locus_tag, subtype)
- Attribute preservation: Key biological attributes (product names, locus tags) are maintained from the original GBFF
Performance Characteristics
The tool is optimized for:
- Memory efficiency: Default 1GB allocation suitable for most bacterial and viral genomes
- Streaming processing: Processes large GBFF files without loading entire file into memory
- Selective parsing: Skips non-essential blocks (ORIGIN sequences, REFERENCE data) to improve speed
- Error tolerance: Continues processing even when individual features contain parsing errors
Use Cases
This tool is particularly useful for:
- Converting NCBI RefSeq annotations to GFF3 for genome browsers
- Preparing annotation data for comparative genomics pipelines
- Extracting specific feature types from complex GenBank records
- Quality control of genome annotations through format conversion
Output Format
The generated GFF3 files follow standard specifications with BBTools-specific enhancements:
Header Structure
##gff-version 3
#BBTools [version] GbffToGff
#seqid source type start end score strand phase attributes
Feature Record Format
Each feature line contains nine tab-separated fields:
- seqid: Sequence identifier (accession number from GBFF)
- source: Always set to '.' (not specified)
- type: Feature type (CDS, rRNA, tRNA, gene, etc.)
- start: Start coordinate (1-based)
- end: End coordinate (inclusive)
- score: Always set to '.' (not calculated)
- strand: '+' or '-' for forward/reverse strand
- phase: Always set to '.' (not calculated)
- attributes: Key-value pairs (product, locus_tag, subtype)
Attribute Details
- product: Protein or RNA product name from the original annotation
- locus_tag: Systematic gene identifier when available
- subtype: For rRNA genes, specifies the ribosomal subunit (5S, 16S, 23S)
Technical Notes
File Format Support
- Input files must be in standard GenBank flat file format (.gbff or .gb)
- Compressed input files (.gz) are automatically detected and supported
- Multi-locus files (multiple LOCUS records) are fully supported
Coordinate System
- GenBank uses 1-based coordinates, which are preserved in GFF3 output
- Feature coordinates include both endpoints (inclusive)
- Strand information is automatically inferred from complement() annotations
Limitations
- Only processes feature types defined in the internal whitelist
- Does not extract sequence data (ORIGIN sections are skipped)
- Complex feature relationships (gene-CDS hierarchies) are flattened
- Some GenBank qualifiers may be omitted if not in the core attribute set
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org