CallGenes

Script: callgenes.sh Package: prok Class: CallGenes.java

Finds orfs and calls genes in unspliced prokaryotes. This includes bacteria, archaea, viruses, and mitochondria. Can also predict 16S, 18S, 23S, 5S, and tRNAs.

Basic Usage

callgenes.sh in=contigs.fa out=calls.gff outa=aminos.faa out16S=16S.fa

The only required parameter is in. All other parameters are optional and have default values.

Parameters

CallGenes provides gene calling capabilities with parameters organized by function.

File parameters

in=<file>
A fasta file; the only required parameter. Can also use infna, fnain, fna, or ref.
out=<file>
Output GFF file for gene annotations.
outa=<file>
Amino acid output file (FASTA format). Also accepts outamino, aminoout, outaa, aaout, or amino.
out16s=<file>
16S rRNA output file (FASTA format). Also accepts 16sout.
out18s=<file>
18S rRNA output file (FASTA format). Also accepts 18sout.
model=<file>
A PGM file or comma-delimited list. If unspecified, a default model will be used. Can use auto or default to use the built-in model. Also accepts pgm or gm.
stats=stderr
Stats output destination (may be stderr, stdout, a file, or null). Also accepts outstats.
hist=null
Gene length histogram output file. Also accepts outhist, lengthhist, lhist, or genehist.
compareto=
Optional reference GFF file to compare with the gene calls. 'auto' will name it based on the input file name.
ingff=
Input GFF file for additional gene annotations. Also accepts gffin.

Formatting parameters

json=false
Print stats in JSON format. Also accepts json_out.
binlen=21
Histogram bin length. Also accepts binlength or histdiv.
bins=1000
Maximum number of histogram bins.
pz=f
(printzero) Print histogram lines with zero count.
ordered=false
Maintain output order matching input order (may reduce performance).
trd=f
(trimreaddescription) Set to true to trim read headers after the first whitespace. Necessary for IGV compatibility.

Gene calling parameters

minlen=60
Don't call genes shorter than this length. Also accepts minlength. Default 60 in shell, but Java uses 80.
maxoverlapss=80
Maximum overlap allowed between genes on the same strand. Also accepts overlapss, overlapsamestrand, moss, or maxOverlapSameStrand.
maxoverlapos=110
Maximum overlap allowed between genes on opposite strands. Also accepts overlapos, overlapoppositestrand, moos, or maxOverlapOppositeStrand.
minStartScore=-0.10
Minimum score required for start codons.
minStopScore=-0.5
Minimum score required for stop codons.
minInnerScore=0.02
Minimum score for inner kmers. Also accepts minKmerScore.
minOrfScore=50
Minimum overall ORF score. Higher values increase specificity but reduce sensitivity. Also accepts minScore.
minAvgScore=0.08
Minimum average kmer score across the ORF.
breakLimit
Maximum number of consecutive low-scoring kmers allowed before breaking an ORF.
clearcutoffs=false
Clear all score cutoffs to very permissive values. Also accepts clearfilters.

Processing parameters

merge=f
For paired reads, merge overlapping reads before calling genes.
ecco=f
Error-correct overlapping paired reads without merging.
passes=1
Number of passes for iterative model refinement. Also accepts 2pass or twopass for 2 passes.
translate=true
Set amino acid output mode to translation (default).
detranslate=f
Output canonical nucleotide sequences instead of amino acids. Also accepts retranslate.
recode=f
Re-encode nucleotide sequences over called genes using canonical codons, leaving non-coding regions unchanged.
plus=true
Process the plus strand.
minus=true
Process the minus strand.

rRNA detection parameters

setbias16s
Set detection bias for 16S rRNA genes.
setbias18s
Set detection bias for 18S rRNA genes.
setbias23s
Set detection bias for 23S rRNA genes.
setbias5s
Set detection bias for 5S rRNA genes.
setbiastRNA
Set detection bias for tRNA genes.
setbiasCDS
Set detection bias for protein-coding sequences.
min16SIdentity
Minimum identity threshold for 16S rRNA detection. Also accepts min16SId.
min18SIdentity
Minimum identity threshold for 18S rRNA detection. Also accepts min18SId.
min23SIdentity
Minimum identity threshold for 23S rRNA detection. Also accepts min23SId.
min5SIdentity
Minimum identity threshold for 5S rRNA detection. Also accepts min5SId.

rRNA alignment parameters

align16s=
Enable 16S sequence alignment. Also accepts load16SSequence.
align18s=
Enable 18S sequence alignment. Also accepts load18SSequence.
align23s=
Enable 23S sequence alignment. Also accepts load23SSequence.
align5s=
Enable 5S sequence alignment. Also accepts load5SSequence.
16sstartslop
Start position tolerance for 16S detection. Also accepts ssustartslop.
16sstopslop
Stop position tolerance for 16S detection. Also accepts ssustopslop.
23sstartslop
Start position tolerance for 23S detection. Also accepts lsustartslop.
23sstopslop
Stop position tolerance for 23S detection. Also accepts lsustopslop.
5sstartslop
Start position tolerance for 5S detection.
5sstopslop
Stop position tolerance for 5S detection.

Kmer parameters

load16skmers
Load kmers for 16S detection. Also accepts load18skmers or loadssukmers.
load23skmers
Load kmers for 23S detection. Also accepts load28skmers or loadlsukmers.
load5skmers
Load kmers for 5S detection.
loadtrnakmers
Load kmers for tRNA detection.
longkmers
Enable all long kmer loading for rRNA and tRNA detection.
klong16s
Kmer length for 16S detection. Also accepts klong18s or klongssu.
klong23s
Kmer length for 23S detection. Also accepts klong28s or klonglsu.
klong5s
Kmer length for 5S detection.
klongtrna
Kmer length for tRNA detection.

Advanced scoring parameters

e1
ORF scoring parameter e1 (frame-specific penalty).
e2
ORF scoring parameter e2 (frame-specific penalty).
e3
ORF scoring parameter e3 (frame-specific penalty).
f1
ORF scoring parameter f1 (frame-specific bonus).
f2
ORF scoring parameter f2 (frame-specific bonus).
f3
ORF scoring parameter f3 (frame-specific bonus).
p0
Gene calling parameter p0 (probability adjustment).
p1
Gene calling parameter p1 (probability adjustment).
p2
Gene calling parameter p2 (probability adjustment).
p3
Gene calling parameter p3 (probability adjustment).
p4
Gene calling parameter p4 (probability adjustment).
p5
Gene calling parameter p5 (probability adjustment).
p6
Gene calling parameter p6 (probability adjustment).
q1
Gene calling parameter q1 (quality adjustment).
q2
Gene calling parameter q2 (quality adjustment).
q3
Gene calling parameter q3 (quality adjustment).
q4
Gene calling parameter q4 (quality adjustment).
q5
Gene calling parameter q5 (quality adjustment).
lookback
Lookback distance for both directions during gene calling.
lookbackplus
Lookback distance in the plus direction.
lookbackminus
Lookback distance in the minus direction.

Statistics parameters

verbose=false
Print detailed progress information.
extended=false
Print extended statistics. Also accepts extendedstats or verbosestats.

Examples

Basic Gene Calling

callgenes.sh in=genome.fasta out=genes.gff

Calls genes in a bacterial genome and outputs GFF annotations.

Full Analysis

callgenes.sh in=contigs.fa out=calls.gff outa=proteins.faa out16S=16S.fa stats=stats.txt

Calls genes, extracts proteins, identifies 16S rRNA, and saves statistics.

High Sensitivity Settings

callgenes.sh in=metagenome.fa out=genes.gff minlen=30 minOrfScore=30 clearcutoffs=true

Uses more permissive settings for calling genes in metagenomic data.

Multi-pass Analysis

callgenes.sh in=genome.fa out=genes.gff passes=2 extended=true

Uses iterative refinement with 2 passes and detailed statistics output.

rRNA and tRNA Detection

callgenes.sh in=genome.fa out=all.gff out16S=16S.fa out18S=18S.fa longkmers=true

Detection of protein-coding genes and RNA genes with enhanced kmer matching.

Algorithm Details

Gene Calling Strategy

CallGenes implements a probabilistic gene calling algorithm using the GeneCaller.callGenes() method with GeneModel for k-mer frequency scoring. The algorithm combines multiple evidence sources through six reading frame analysis:

Dual Strategy for Sequence Processing

The algorithm employs different strategies based on sequence type:

RNA Gene Detection

CallGenes includes specialized modules for non-coding RNA detection:

Multi-pass Refinement

The multi-pass feature enables iterative model improvement:

  1. Initial gene calling with default or provided model
  2. Model training on initial predictions
  3. Refined gene calling with improved model
  4. Optional additional passes for further refinement

Performance Characteristics

Output Formats

CallGenes produces multiple output formats:

Support

For questions and support: