CallGenes
Finds orfs and calls genes in unspliced prokaryotes. This includes bacteria, archaea, viruses, and mitochondria. Can also predict 16S, 18S, 23S, 5S, and tRNAs.
Basic Usage
callgenes.sh in=contigs.fa out=calls.gff outa=aminos.faa out16S=16S.fa
The only required parameter is in
. All other parameters are optional and have default values.
Parameters
CallGenes provides gene calling capabilities with parameters organized by function.
File parameters
- in=<file>
- A fasta file; the only required parameter. Can also use infna, fnain, fna, or ref.
- out=<file>
- Output GFF file for gene annotations.
- outa=<file>
- Amino acid output file (FASTA format). Also accepts outamino, aminoout, outaa, aaout, or amino.
- out16s=<file>
- 16S rRNA output file (FASTA format). Also accepts 16sout.
- out18s=<file>
- 18S rRNA output file (FASTA format). Also accepts 18sout.
- model=<file>
- A PGM file or comma-delimited list. If unspecified, a default model will be used. Can use auto or default to use the built-in model. Also accepts pgm or gm.
- stats=stderr
- Stats output destination (may be stderr, stdout, a file, or null). Also accepts outstats.
- hist=null
- Gene length histogram output file. Also accepts outhist, lengthhist, lhist, or genehist.
- compareto=
- Optional reference GFF file to compare with the gene calls. 'auto' will name it based on the input file name.
- ingff=
- Input GFF file for additional gene annotations. Also accepts gffin.
Formatting parameters
- json=false
- Print stats in JSON format. Also accepts json_out.
- binlen=21
- Histogram bin length. Also accepts binlength or histdiv.
- bins=1000
- Maximum number of histogram bins.
- pz=f
- (printzero) Print histogram lines with zero count.
- ordered=false
- Maintain output order matching input order (may reduce performance).
- trd=f
- (trimreaddescription) Set to true to trim read headers after the first whitespace. Necessary for IGV compatibility.
Gene calling parameters
- minlen=60
- Don't call genes shorter than this length. Also accepts minlength. Default 60 in shell, but Java uses 80.
- maxoverlapss=80
- Maximum overlap allowed between genes on the same strand. Also accepts overlapss, overlapsamestrand, moss, or maxOverlapSameStrand.
- maxoverlapos=110
- Maximum overlap allowed between genes on opposite strands. Also accepts overlapos, overlapoppositestrand, moos, or maxOverlapOppositeStrand.
- minStartScore=-0.10
- Minimum score required for start codons.
- minStopScore=-0.5
- Minimum score required for stop codons.
- minInnerScore=0.02
- Minimum score for inner kmers. Also accepts minKmerScore.
- minOrfScore=50
- Minimum overall ORF score. Higher values increase specificity but reduce sensitivity. Also accepts minScore.
- minAvgScore=0.08
- Minimum average kmer score across the ORF.
- breakLimit
- Maximum number of consecutive low-scoring kmers allowed before breaking an ORF.
- clearcutoffs=false
- Clear all score cutoffs to very permissive values. Also accepts clearfilters.
Processing parameters
- merge=f
- For paired reads, merge overlapping reads before calling genes.
- ecco=f
- Error-correct overlapping paired reads without merging.
- passes=1
- Number of passes for iterative model refinement. Also accepts 2pass or twopass for 2 passes.
- translate=true
- Set amino acid output mode to translation (default).
- detranslate=f
- Output canonical nucleotide sequences instead of amino acids. Also accepts retranslate.
- recode=f
- Re-encode nucleotide sequences over called genes using canonical codons, leaving non-coding regions unchanged.
- plus=true
- Process the plus strand.
- minus=true
- Process the minus strand.
rRNA detection parameters
- setbias16s
- Set detection bias for 16S rRNA genes.
- setbias18s
- Set detection bias for 18S rRNA genes.
- setbias23s
- Set detection bias for 23S rRNA genes.
- setbias5s
- Set detection bias for 5S rRNA genes.
- setbiastRNA
- Set detection bias for tRNA genes.
- setbiasCDS
- Set detection bias for protein-coding sequences.
- min16SIdentity
- Minimum identity threshold for 16S rRNA detection. Also accepts min16SId.
- min18SIdentity
- Minimum identity threshold for 18S rRNA detection. Also accepts min18SId.
- min23SIdentity
- Minimum identity threshold for 23S rRNA detection. Also accepts min23SId.
- min5SIdentity
- Minimum identity threshold for 5S rRNA detection. Also accepts min5SId.
rRNA alignment parameters
- align16s=
- Enable 16S sequence alignment. Also accepts load16SSequence.
- align18s=
- Enable 18S sequence alignment. Also accepts load18SSequence.
- align23s=
- Enable 23S sequence alignment. Also accepts load23SSequence.
- align5s=
- Enable 5S sequence alignment. Also accepts load5SSequence.
- 16sstartslop
- Start position tolerance for 16S detection. Also accepts ssustartslop.
- 16sstopslop
- Stop position tolerance for 16S detection. Also accepts ssustopslop.
- 23sstartslop
- Start position tolerance for 23S detection. Also accepts lsustartslop.
- 23sstopslop
- Stop position tolerance for 23S detection. Also accepts lsustopslop.
- 5sstartslop
- Start position tolerance for 5S detection.
- 5sstopslop
- Stop position tolerance for 5S detection.
Kmer parameters
- load16skmers
- Load kmers for 16S detection. Also accepts load18skmers or loadssukmers.
- load23skmers
- Load kmers for 23S detection. Also accepts load28skmers or loadlsukmers.
- load5skmers
- Load kmers for 5S detection.
- loadtrnakmers
- Load kmers for tRNA detection.
- longkmers
- Enable all long kmer loading for rRNA and tRNA detection.
- klong16s
- Kmer length for 16S detection. Also accepts klong18s or klongssu.
- klong23s
- Kmer length for 23S detection. Also accepts klong28s or klonglsu.
- klong5s
- Kmer length for 5S detection.
- klongtrna
- Kmer length for tRNA detection.
Advanced scoring parameters
- e1
- ORF scoring parameter e1 (frame-specific penalty).
- e2
- ORF scoring parameter e2 (frame-specific penalty).
- e3
- ORF scoring parameter e3 (frame-specific penalty).
- f1
- ORF scoring parameter f1 (frame-specific bonus).
- f2
- ORF scoring parameter f2 (frame-specific bonus).
- f3
- ORF scoring parameter f3 (frame-specific bonus).
- p0
- Gene calling parameter p0 (probability adjustment).
- p1
- Gene calling parameter p1 (probability adjustment).
- p2
- Gene calling parameter p2 (probability adjustment).
- p3
- Gene calling parameter p3 (probability adjustment).
- p4
- Gene calling parameter p4 (probability adjustment).
- p5
- Gene calling parameter p5 (probability adjustment).
- p6
- Gene calling parameter p6 (probability adjustment).
- q1
- Gene calling parameter q1 (quality adjustment).
- q2
- Gene calling parameter q2 (quality adjustment).
- q3
- Gene calling parameter q3 (quality adjustment).
- q4
- Gene calling parameter q4 (quality adjustment).
- q5
- Gene calling parameter q5 (quality adjustment).
- lookback
- Lookback distance for both directions during gene calling.
- lookbackplus
- Lookback distance in the plus direction.
- lookbackminus
- Lookback distance in the minus direction.
Statistics parameters
- verbose=false
- Print detailed progress information.
- extended=false
- Print extended statistics. Also accepts extendedstats or verbosestats.
Examples
Basic Gene Calling
callgenes.sh in=genome.fasta out=genes.gff
Calls genes in a bacterial genome and outputs GFF annotations.
Full Analysis
callgenes.sh in=contigs.fa out=calls.gff outa=proteins.faa out16S=16S.fa stats=stats.txt
Calls genes, extracts proteins, identifies 16S rRNA, and saves statistics.
High Sensitivity Settings
callgenes.sh in=metagenome.fa out=genes.gff minlen=30 minOrfScore=30 clearcutoffs=true
Uses more permissive settings for calling genes in metagenomic data.
Multi-pass Analysis
callgenes.sh in=genome.fa out=genes.gff passes=2 extended=true
Uses iterative refinement with 2 passes and detailed statistics output.
rRNA and tRNA Detection
callgenes.sh in=genome.fa out=all.gff out16S=16S.fa out18S=18S.fa longkmers=true
Detection of protein-coding genes and RNA genes with enhanced kmer matching.
Algorithm Details
Gene Calling Strategy
CallGenes implements a probabilistic gene calling algorithm using the GeneCaller.callGenes() method with GeneModel for k-mer frequency scoring. The algorithm combines multiple evidence sources through six reading frame analysis:
- K-mer-based scoring: Uses probabilistic gene models (PGM files) with StatsContainer and FrameStats classes to score potential ORFs based on k-mer frequency patterns
- Start/stop codon analysis: Evaluates start and stop signals using minStartScore and minStopScore thresholds in context-dependent scoring
- Frame-specific penalties: Applies different scoring based on reading frame to account for codon usage bias
- Overlap resolution: Intelligently handles overlapping ORFs with configurable overlap limits
Dual Strategy for Sequence Processing
The algorithm employs different strategies based on sequence type:
- Single sequences: Direct ORF calling with strand-specific processing
- Paired reads: Optional merging with BBMerge or error correction with ECCO before gene calling
RNA Gene Detection
CallGenes includes specialized modules for non-coding RNA detection:
- rRNA detection: Supports 16S, 18S, 23S, and 5S rRNA identification using kmer matching and sequence alignment
- tRNA detection: Identifies transfer RNA genes with adjustable sensitivity parameters
- Configurable thresholds: Identity and bias parameters allow fine-tuning for different organisms
Multi-pass Refinement
The multi-pass feature enables iterative model improvement:
- Initial gene calling with default or provided model
- Model training on initial predictions
- Refined gene calling with improved model
- Optional additional passes for further refinement
Performance Characteristics
- Memory usage: Scales with genome size and model complexity; typically 6GB default heap
- Threading: Automatic parallelization across available CPU cores
- I/O optimization: Supports compressed input files and ordered/unordered output modes
- Scalability: Handles single genomes to large metagenomic assemblies
Output Formats
CallGenes produces multiple output formats:
- GFF3: Standard genome annotation format with detailed gene features
- FASTA: Amino acid sequences for protein-coding genes
- Statistics: Performance metrics in text or JSON format
- Histograms: Gene length distributions for quality assessment
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org