CompareGFF

Basic Usage

comparegff.sh in=<input gff> ref=<reference gff>

Compare a query GFF file against a reference GFF file to evaluate gene prediction accuracy. The tool focuses on CDS (coding sequences), rRNA (ribosomal RNA), and tRNA (transfer RNA) features.

Parameters

Parameters control input files and processing behavior for GFF comparison analysis.

Standard Parameters

in=<file>: Query GFF file to be evaluated. This is the gene prediction file that will be compared against the reference.
ref=<file>: Reference GFF file containing the "ground truth" gene annotations used for comparison.

Processing Parameters

lines=<number>: Maximum number of lines to process from the query GFF file. Default: unlimited (processes entire file). Set to -1 for unlimited processing.
verbose=<boolean>: Enable verbose output showing detailed debugging information including hash map contents and individual line processing details. Default: false.

Examples

Basic Comparison

comparegff.sh in=predicted_genes.gff ref=reference_genes.gff

Compare predicted gene annotations against a reference annotation set to evaluate gene-calling accuracy.

Verbose Analysis

comparegff.sh in=predicted_genes.gff ref=reference_genes.gff verbose=true

Perform comparison with detailed debugging output showing internal data structures and processing steps.

Limited Processing

comparegff.sh in=large_prediction.gff ref=reference.gff lines=10000

Process only the first 10,000 lines of the query GFF file for quick testing or partial analysis.

Algorithm Details

Comparison Strategy

CompareGFF implements a hash-based gene prediction evaluation system using GffLine.loadGffFile() for reference processing and StringNum key mapping for coordinate lookups. The algorithm focuses on three key gene types: CDS (protein-coding genes), rRNA (ribosomal RNA), and tRNA (transfer RNA) through ProkObject.processType() filtering with a dual-phase comparison strategy.

Reference Processing Phase

The tool first loads all reference GFF lines matching the target feature types (CDS, rRNA, tRNA) using GffLine.loadGffFile(ffref, "CDS,rRNA,tRNA", true) and constructs three HashMap data structures:

lineMap: HashMap<StringNum, GffLine> mapping sequence ID + stop position to reference features using StringNum(gline.seqid, stop) keys
startCountMap: HashMap<StringNum, Integer> initialized to zero for tracking correct start position matches
stopCountMap: HashMap<StringNum, Integer> initialized to zero for tracking correct stop position matches

Query Evaluation Phase

For each query GFF line processed through processLine(), the algorithm performs strand-aware, type-specific matching:

Position Matching: Uses StringNum(gline.seqid, stop) as primary key for lineMap.get() lookup
Strand Verification: Compares refline.strand != gline.strand for compatibility check
Type Verification: Confirms !refline.type.equals(gline.type) for exact feature type matching
Boundary Analysis: Separately evaluates start==refline.trueStart() and stop==refline.trueStop() positions

Statistical Metrics

The tool calculates dual-perspective accuracy metrics through separate counter variables:

True Positives: truePositiveStart, truePositiveStop (ref-relative) and truePositiveStart2, truePositiveStop2 (query-relative)
False Positives: falsePositiveStart, falsePositiveStop (ref-relative) and falsePositiveStart2, falsePositiveStop2 (query-relative)
False Negatives: falseNegativeStart, falseNegativeStop calculated by iterating through count maps for values < 1
Signal-to-Noise Ratio: 10 * Math.log10((truePositiveStart2+truePositiveStop2+0.1)/(falsePositiveStart2+falsePositiveStop2+0.1))

Performance Characteristics

The algorithm uses HashMap.get() operations for O(1) average-case lookup complexity during query evaluation. Memory usage scales linearly with reference annotation size through lineMap storage, with the StringNum key system providing memory-efficient storage for sequence ID + integer position pairs compared to string concatenation.

Feature Type Processing

The tool integrates with BBTools' ProkObject.processType(gline.prokType()) method to handle prokaryotic gene prediction filtering, including configurable processing of different RNA types and CDS features based on the returned prokType() classification from GffLine objects.

Output Format

CompareGFF produces detailed statistical reports through outstream.println() calls including:

Reference Statistics: Count and percentage of true/false positives and negatives relative to refCount using Tools.format("%.3f%%", value*100.0/refCount)
Query Statistics: Count and percentage of true/false positives relative to queryCount with similar formatting
Signal-to-Noise Ratio: Overall quality metric formatted using Tools.format("%.4f", calculatedSNR)
Processing Summary: Lines processed, bytes processed, and execution time via Tools.timeLinesBytesProcessed()

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org