CompareGFF

Script: comparegff.sh Package: gff Class: CompareGff.java

Compares CDS, rRNA, and tRNA lines in gff files for grading gene-calling accuracy with detailed statistical analysis including true/false positives and signal-to-noise ratio calculations.

Basic Usage

comparegff.sh in=<input gff> ref=<reference gff>

Compare a query GFF file against a reference GFF file to evaluate gene prediction accuracy. The tool focuses on CDS (coding sequences), rRNA (ribosomal RNA), and tRNA (transfer RNA) features.

Parameters

Parameters control input files and processing behavior for GFF comparison analysis.

Standard Parameters

in=<file>
Query GFF file to be evaluated. This is the gene prediction file that will be compared against the reference.
ref=<file>
Reference GFF file containing the "ground truth" gene annotations used for comparison.

Processing Parameters

lines=<number>
Maximum number of lines to process from the query GFF file. Default: unlimited (processes entire file). Set to -1 for unlimited processing.
verbose=<boolean>
Enable verbose output showing detailed debugging information including hash map contents and individual line processing details. Default: false.

Examples

Basic Comparison

comparegff.sh in=predicted_genes.gff ref=reference_genes.gff

Compare predicted gene annotations against a reference annotation set to evaluate gene-calling accuracy.

Verbose Analysis

comparegff.sh in=predicted_genes.gff ref=reference_genes.gff verbose=true

Perform comparison with detailed debugging output showing internal data structures and processing steps.

Limited Processing

comparegff.sh in=large_prediction.gff ref=reference.gff lines=10000

Process only the first 10,000 lines of the query GFF file for quick testing or partial analysis.

Algorithm Details

Comparison Strategy

CompareGFF implements a hash-based gene prediction evaluation system using GffLine.loadGffFile() for reference processing and StringNum key mapping for coordinate lookups. The algorithm focuses on three key gene types: CDS (protein-coding genes), rRNA (ribosomal RNA), and tRNA (transfer RNA) through ProkObject.processType() filtering with a dual-phase comparison strategy.

Reference Processing Phase

The tool first loads all reference GFF lines matching the target feature types (CDS, rRNA, tRNA) using GffLine.loadGffFile(ffref, "CDS,rRNA,tRNA", true) and constructs three HashMap data structures:

Query Evaluation Phase

For each query GFF line processed through processLine(), the algorithm performs strand-aware, type-specific matching:

Statistical Metrics

The tool calculates dual-perspective accuracy metrics through separate counter variables:

Performance Characteristics

The algorithm uses HashMap.get() operations for O(1) average-case lookup complexity during query evaluation. Memory usage scales linearly with reference annotation size through lineMap storage, with the StringNum key system providing memory-efficient storage for sequence ID + integer position pairs compared to string concatenation.

Feature Type Processing

The tool integrates with BBTools' ProkObject.processType(gline.prokType()) method to handle prokaryotic gene prediction filtering, including configurable processing of different RNA types and CDS features based on the returned prokType() classification from GffLine objects.

Output Format

CompareGFF produces detailed statistical reports through outstream.println() calls including:

Support

For questions and support: