AnalyzeGenes
Generates a prokaryotic gene model (.pkm) for gene calling. Input is fasta and gff files. The .pkm file may be used by CallGenes.
Basic Usage
analyzegenes.sh in=x.fa gff=x.gff out=x.pgm
This tool analyzes paired prokaryotic fasta and GFF files to calculate patterns in coding and noncoding frames, start and stop sites, producing a prokaryotic gene model (PGM) file for use with CallGenes.
Parameters
Parameters control file input/output and gene analysis behavior.
File parameters
- in=<file>
- A fasta file or comma-delimited list of fasta files.
- gff=<file>
- A gff file or comma-delimited list. This is optional; if present, it must match the number of fasta files. If absent, a fasta file 'foo.fasta' will imply the presence of 'foo.gff'.
- out=<file>
- Output pgm file.
Analysis parameters
- verbose=f
- Print verbose output during processing.
- alignribo=t
- Load ribosomal consensus sequences for gene alignment and analysis.
- adjustendpoints=t
- Adjust gene start/stop endpoints during analysis.
Gene Model K-mer parameters
- kinnercds
- K-mer size for inner CDS regions.
- kstartcds
- K-mer size for CDS start sites.
- kstopcds
- K-mer size for CDS stop sites.
- kinnerrna
- K-mer size for inner RNA regions. Default: 6
- kstartrna
- K-mer size for RNA start sites. Default: 3
- kstoprna
- K-mer size for RNA stop sites. Default: 3
Gene Model Position Offset parameters
- startleftoffset
- Left offset for start codon analysis.
- startrightoffset
- Right offset for start codon analysis.
- stopleftoffset
- Left offset for stop codon analysis.
- stoprightoffset
- Right offset for stop codon analysis.
Gene Type Control parameters
- callcds=t
- Enable calling of CDS (protein-coding) genes.
- calltrna=t
- Enable calling of tRNA genes.
- call16s=t
- Enable calling of 16S rRNA genes.
- call23s=t
- Enable calling of 23S rRNA genes.
- call5s=t
- Enable calling of 5S rRNA genes.
- call18s=f
- Enable calling of 18S rRNA genes. Default: false
- callrna
- Enable calling of all RNA types (tRNA, 16S, 23S, 5S, 18S).
- cdsonly=f
- Call only CDS genes, disable all RNA calling.
- 16sonly=f
- Call only 16S rRNA genes, disable all other types.
- 23sonly=f
- Call only 23S rRNA genes, disable all other types.
- 5sonly=f
- Call only 5S rRNA genes, disable all other types.
- trnaonly=f
- Call only tRNA genes, disable all other types.
- 18sonly=f
- Call only 18S rRNA genes, disable all other types.
rRNA Identity Thresholds
- min16sid=0.62
- Minimum identity threshold for 16S rRNA gene calling.
- min23sid=0.60
- Minimum identity threshold for 23S rRNA gene calling.
- min5sid=0.60
- Minimum identity threshold for 5S rRNA gene calling.
- min18sid=0.60
- Minimum identity threshold for 18S rRNA gene calling.
rRNA Start/Stop Slop parameters
- 16sstartslop=200
- Start slop for 16S rRNA gene boundaries. Also ssustartslop.
- 16sstopslop=0
- Stop slop for 16S rRNA gene boundaries. Also ssustopslop.
- 23sstartslop=220
- Start slop for 23S rRNA gene boundaries. Also lsustartslop.
- 23sstopslop=0
- Stop slop for 23S rRNA gene boundaries. Also lsustopslop.
- 5sstartslop=50
- Start slop for 5S rRNA gene boundaries.
- 5sstopslop=50
- Stop slop for 5S rRNA gene boundaries.
Strand Processing
- plus=t
- Process the plus strand during gene analysis.
- minus=t
- Process the minus strand during gene analysis.
Sequence Loading Control
- align16s=t
- Load 16S consensus sequences for alignment. Also load16ssequence.
- align23s=t
- Load 23S consensus sequences for alignment. Also load23ssequence.
- align18s=t
- Load 18S consensus sequences for alignment. Also load18ssequence.
- align5s=t
- Load 5S consensus sequences for alignment. Also load5ssequence.
Long K-mer Control
- load16skmers=t
- Load 16S long k-mers for gene identification. Also loadssukmers.
- load23skmers=t
- Load 23S long k-mers for gene identification. Also loadlsukmers.
- load5skmers=t
- Load 5S long k-mers for gene identification.
- loadtrnakmers=t
- Load tRNA long k-mers for gene identification.
- longkmers
- Enable all long k-mer loading (16S, 23S, 5S, tRNA).
- klong16s=15
- K-mer length for 16S long k-mers. Also klongssu.
- klong23s=15
- K-mer length for 23S long k-mers. Also klonglsu.
- klong5s=15
- K-mer length for 5S long k-mers.
- klongtrna=15
- K-mer length for tRNA long k-mers.
Additional Model parameters
- addcdsonly=f
- Add only CDS genes to the model, exclude RNA types.
- normalize=f
- Normalize gene models during processing.
Examples
Basic Gene Model Generation
analyzegenes.sh in=genome.fasta gff=genome.gff out=genome.pgm
Generates a prokaryotic gene model from paired fasta and GFF files.
Multiple Input Files
analyzegenes.sh in=genome1.fa,genome2.fa gff=genome1.gff,genome2.gff out=combined.pgm
Processes multiple paired fasta/GFF files to create a combined gene model.
CDS-Only Gene Model
analyzegenes.sh in=genome.fa gff=genome.gff out=cds_only.pgm cdsonly=t
Creates a gene model containing only CDS (protein-coding) genes, excluding all RNA types.
Custom rRNA Thresholds
analyzegenes.sh in=genome.fa gff=genome.gff out=strict.pgm min16sid=0.8 min23sid=0.75
Uses stricter identity thresholds for 16S and 23S rRNA gene calling.
Algorithm Details
AnalyzeGenes creates prokaryotic gene models by analyzing paired fasta and GFF files to identify patterns in coding and non-coding sequences. The algorithm processes multiple file types and generates statistical models for gene calling.
Multi-threaded Processing
The tool uses FileThread workers with AtomicInteger-based work distribution. Thread count calculation: min(fnaList.size(), Shared.threads(), max(32, logical_processors/2)). Each FileThread processes file pairs via atomic getAndIncrement() to ensure thread-safe work allocation without contention.
Gene Type Analysis
The algorithm analyzes multiple gene types:
- CDS: Protein-coding sequences with start/stop codon analysis
- tRNA: Transfer RNA genes with specialized k-mer patterns
- 16S/18S rRNA: Small subunit ribosomal RNA (SSU)
- 23S rRNA: Large subunit ribosomal RNA (LSU)
- 5S rRNA: 5S ribosomal RNA with distinct patterns
K-mer Based Pattern Recognition
The tool employs different k-mer strategies for different gene regions:
- Inner regions: Variable k-mer sizes for CDS (customizable) and RNA (default k=6)
- Start sites: Short k-mers (default k=3 for RNA) to capture initiation patterns
- Stop sites: Short k-mers (default k=3 for RNA) for termination signals
- Long k-mers: 15-mer default for enhanced specificity in rRNA and tRNA identification
Consensus Sequence Alignment
When alignribo=true, the tool calls ProkObject.loadConsensusSequenceFromFile() to load reference sequences (r16SSequence, r23SSequence, r18SSequence, r5SSequence arrays). Identity thresholds (min16SIdentity, min23SIdentity, etc.) filter matches during alignment-based validation via processType() method calls.
Statistical Model Generation
The algorithm builds GeneModel objects containing StatsContainer instances for each gene type (statsCDS, statstRNA, stats16S, stats23S, stats5S, stats18S). Each container tracks lengthCount statistics, k-mer frequency patterns, and position-specific boundary characteristics. Models support addition/merging via GeneModel.add() method and normalization via multiplyBy() scaling.
Performance Characteristics
Memory usage scales with input file count and gene density. Statistics reporting via readsBasesGenesProcessed() calculates throughput metrics: files/sec (fpnano*1000000000), sequences/sec (rpnano*1000000), genes/sec (gpnano*1000000), bases/sec (bpnano*1000). ByteFile.FORCE_MODE_BF2 enabled for multi-threaded I/O when threads>2.
Output Format
The resulting .pgm (prokaryotic gene model) file contains binary-encoded statistical patterns suitable for use with CallGenes and other BBTools prokaryotic gene prediction tools.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org