cutRna.sh
Extracts ribosomal RNA (16S, 23S, 5S) and tRNA sequences from annotated genomes with paired FASTA (.fna.gz) and GFF (.gff.gz) files. Creates specialized kmer sets for sequence classification where every sequence contains at least one kmer from the set. Used by CallGenes pipeline for RNA identification.
Purpose
This pipeline extracts specific RNA types from annotated genomes to generate kmer sets for sequence classification. The resulting kmer sets enable detection of sequences that lack matches - indicating they are likely not of the target RNA type. While CallGenes uses this pipeline for general RNA detection, Silva database is preferred for 16S and 23S ribosomal RNA identification.
Prerequisites
- Annotated genome files in paired format:
- FASTA files (.fna.gz) containing genome sequences
- GFF files (.gff.gz) containing gene annotations
- Files organized in subdirectories with matching names
- GFF annotations must include rRNA and tRNA feature types
- Sufficient disk space for intermediate and output files
Pipeline Stages
Stage 1: RNA Sequence Extraction
Uses cutgff.sh to extract different RNA types from annotated genomes:
# Extract 23S ribosomal RNA
cutgff.sh fastawrap=10k types=rRNA out=23S.fa attributes=23S */*.fna.gz
# Extract 16S ribosomal RNA
cutgff.sh fastawrap=10k types=rRNA out=16S.fa attributes=16S */*.fna.gz
# Extract 5S ribosomal RNA
cutgff.sh fastawrap=10k types=rRNA out=5S.fa attributes=5S */*.fna.gz
# Extract tRNA sequences
cutgff.sh fastawrap=10k types=tRNA out=tRNA.fa */*.fna.gz
Parameters:
fastawrap=10k
- Wraps FASTA sequences at 10,000 characters per linetypes=rRNA/tRNA
- Specifies feature type to extract from GFFattributes=
- Filters by specific RNA subtype (16S, 23S, 5S)*/*.fna.gz
- Processes all .fna.gz files in subdirectories
Stage 2: Kmer Set Generation
Creates unique kmer sets from extracted RNA sequences using different kmer sizes optimized for each RNA type:
# Generate 15-mers for large ribosomal RNAs
kmerfilterset.sh in=23S.fa k=15 out=23S_15mers.fa ow minkpp=1 maxkpp=1 rcomp=f
kmerfilterset.sh in=16S.fa k=15 out=16S_15mers.fa ow minkpp=1 maxkpp=1 rcomp=f
# Generate 9-mers for small RNA types
kmerfilterset.sh in=5S.fa k=9 out=5S_9mers.fa ow minkpp=1 maxkpp=1 rcomp=f
kmerfilterset.sh in=tRNA.fa k=9 out=tRNA_9mers.fa ow minkpp=1 maxkpp=1 rcomp=f
Parameters:
k=15/9
- Kmer size: 15 for large rRNAs, 9 for small RNAsminkpp=1 maxkpp=1
- Exactly one kmer per positionrcomp=f
- Disables reverse complement considerationow
- Overwrites existing output files
Output Files
The pipeline generates the following output files:
Extracted RNA Sequences
23S.fa
- Large ribosomal subunit RNA sequences16S.fa
- Small ribosomal subunit RNA sequences5S.fa
- 5S ribosomal RNA sequencestRNA.fa
- Transfer RNA sequences
Kmer Sets
23S_15mers.fa
- 15-mer set for 23S rRNA classification16S_15mers.fa
- 15-mer set for 16S rRNA classification5S_9mers.fa
- 9-mer set for 5S rRNA classificationtRNA_9mers.fa
- 9-mer set for tRNA classification
Usage Example
# Prepare directory structure with genome files
mkdir -p genomes/genome1 genomes/genome2
# Place paired .fna.gz and .gff.gz files in each subdirectory
# Run the pipeline
./cutRna.sh
# Use generated kmer sets for classification
bbduk.sh in=query.fa ref=16S_15mers.fa k=15 stats=stats.txt
Implementation Notes
- Kmer Size Selection: 15-mers provide specificity for large rRNAs (16S, 23S), while 9-mers are sufficient for shorter sequences (5S, tRNA)
- Classification Logic: Sequences sharing kmers with the set are likely of that RNA type; sequences with no kmer matches are probably not
- Silva Integration: While this pipeline creates rRNA kmer sets, Silva database is recommended for 16S and 23S classification in production
- File Organization: Input files must follow the pattern */*.fna.gz with corresponding .gff.gz files
- Memory Requirements: Moderate memory usage depending on genome collection size
Related Tools
cutgff.sh
- Extracts sequences based on GFF annotationskmerfilterset.sh
- Generates filtered kmer sets from sequencesCallGenes
- Uses generated kmer sets for gene callingbbduk.sh
- Can use kmer sets for sequence filtering/classification