CompareGFF
Compares CDS, rRNA, and tRNA lines in gff files for grading gene-calling accuracy with detailed statistical analysis including true/false positives and signal-to-noise ratio calculations.
Basic Usage
comparegff.sh in=<input gff> ref=<reference gff>
Compare a query GFF file against a reference GFF file to evaluate gene prediction accuracy. The tool focuses on CDS (coding sequences), rRNA (ribosomal RNA), and tRNA (transfer RNA) features.
Parameters
Parameters control input files and processing behavior for GFF comparison analysis.
Standard Parameters
- in=<file>
- Query GFF file to be evaluated. This is the gene prediction file that will be compared against the reference.
- ref=<file>
- Reference GFF file containing the "ground truth" gene annotations used for comparison.
Processing Parameters
- lines=<number>
- Maximum number of lines to process from the query GFF file. Default: unlimited (processes entire file). Set to -1 for unlimited processing.
- verbose=<boolean>
- Enable verbose output showing detailed debugging information including hash map contents and individual line processing details. Default: false.
Examples
Basic Comparison
comparegff.sh in=predicted_genes.gff ref=reference_genes.gff
Compare predicted gene annotations against a reference annotation set to evaluate gene-calling accuracy.
Verbose Analysis
comparegff.sh in=predicted_genes.gff ref=reference_genes.gff verbose=true
Perform comparison with detailed debugging output showing internal data structures and processing steps.
Limited Processing
comparegff.sh in=large_prediction.gff ref=reference.gff lines=10000
Process only the first 10,000 lines of the query GFF file for quick testing or partial analysis.
Algorithm Details
Comparison Strategy
CompareGFF implements a hash-based gene prediction evaluation system using GffLine.loadGffFile() for reference processing and StringNum key mapping for coordinate lookups. The algorithm focuses on three key gene types: CDS (protein-coding genes), rRNA (ribosomal RNA), and tRNA (transfer RNA) through ProkObject.processType() filtering with a dual-phase comparison strategy.
Reference Processing Phase
The tool first loads all reference GFF lines matching the target feature types (CDS, rRNA, tRNA) using GffLine.loadGffFile(ffref, "CDS,rRNA,tRNA", true) and constructs three HashMap data structures:
- lineMap: HashMap<StringNum, GffLine> mapping sequence ID + stop position to reference features using StringNum(gline.seqid, stop) keys
- startCountMap: HashMap<StringNum, Integer> initialized to zero for tracking correct start position matches
- stopCountMap: HashMap<StringNum, Integer> initialized to zero for tracking correct stop position matches
Query Evaluation Phase
For each query GFF line processed through processLine(), the algorithm performs strand-aware, type-specific matching:
- Position Matching: Uses StringNum(gline.seqid, stop) as primary key for lineMap.get() lookup
- Strand Verification: Compares refline.strand != gline.strand for compatibility check
- Type Verification: Confirms !refline.type.equals(gline.type) for exact feature type matching
- Boundary Analysis: Separately evaluates start==refline.trueStart() and stop==refline.trueStop() positions
Statistical Metrics
The tool calculates dual-perspective accuracy metrics through separate counter variables:
- True Positives: truePositiveStart, truePositiveStop (ref-relative) and truePositiveStart2, truePositiveStop2 (query-relative)
- False Positives: falsePositiveStart, falsePositiveStop (ref-relative) and falsePositiveStart2, falsePositiveStop2 (query-relative)
- False Negatives: falseNegativeStart, falseNegativeStop calculated by iterating through count maps for values < 1
- Signal-to-Noise Ratio: 10 * Math.log10((truePositiveStart2+truePositiveStop2+0.1)/(falsePositiveStart2+falsePositiveStop2+0.1))
Performance Characteristics
The algorithm uses HashMap.get() operations for O(1) average-case lookup complexity during query evaluation. Memory usage scales linearly with reference annotation size through lineMap storage, with the StringNum key system providing memory-efficient storage for sequence ID + integer position pairs compared to string concatenation.
Feature Type Processing
The tool integrates with BBTools' ProkObject.processType(gline.prokType()) method to handle prokaryotic gene prediction filtering, including configurable processing of different RNA types and CDS features based on the returned prokType() classification from GffLine objects.
Output Format
CompareGFF produces detailed statistical reports through outstream.println() calls including:
- Reference Statistics: Count and percentage of true/false positives and negatives relative to refCount using Tools.format("%.3f%%", value*100.0/refCount)
- Query Statistics: Count and percentage of true/false positives relative to queryCount with similar formatting
- Signal-to-Noise Ratio: Overall quality metric formatted using Tools.format("%.4f", calculatedSNR)
- Processing Summary: Lines processed, bytes processed, and execution time via Tools.timeLinesBytesProcessed()
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org