cutRna.sh

Script: cutRna.sh Author: Brian Bushnell Last Updated: August 8, 2019

Extracts ribosomal RNA (16S, 23S, 5S) and tRNA sequences from annotated genomes with paired FASTA (.fna.gz) and GFF (.gff.gz) files. Creates specialized kmer sets for sequence classification where every sequence contains at least one kmer from the set. Used by CallGenes pipeline for RNA identification.

Purpose

This pipeline extracts specific RNA types from annotated genomes to generate kmer sets for sequence classification. The resulting kmer sets enable detection of sequences that lack matches - indicating they are likely not of the target RNA type. While CallGenes uses this pipeline for general RNA detection, Silva database is preferred for 16S and 23S ribosomal RNA identification.

Prerequisites

Annotated genome files in paired format:
- FASTA files (.fna.gz) containing genome sequences
- GFF files (.gff.gz) containing gene annotations
Files organized in subdirectories with matching names
GFF annotations must include rRNA and tRNA feature types
Sufficient disk space for intermediate and output files

Pipeline Stages

Stage 1: RNA Sequence Extraction

Uses cutgff.sh to extract different RNA types from annotated genomes:

# Extract 23S ribosomal RNA
cutgff.sh fastawrap=10k types=rRNA out=23S.fa attributes=23S */*.fna.gz

# Extract 16S ribosomal RNA  
cutgff.sh fastawrap=10k types=rRNA out=16S.fa attributes=16S */*.fna.gz

# Extract 5S ribosomal RNA
cutgff.sh fastawrap=10k types=rRNA out=5S.fa attributes=5S */*.fna.gz

# Extract tRNA sequences
cutgff.sh fastawrap=10k types=tRNA out=tRNA.fa */*.fna.gz

Parameters:

fastawrap=10k - Wraps FASTA sequences at 10,000 characters per line
types=rRNA/tRNA - Specifies feature type to extract from GFF
attributes= - Filters by specific RNA subtype (16S, 23S, 5S)
*/*.fna.gz - Processes all .fna.gz files in subdirectories

Stage 2: Kmer Set Generation

Creates unique kmer sets from extracted RNA sequences using different kmer sizes optimized for each RNA type:

# Generate 15-mers for large ribosomal RNAs
kmerfilterset.sh in=23S.fa k=15 out=23S_15mers.fa ow minkpp=1 maxkpp=1 rcomp=f
kmerfilterset.sh in=16S.fa k=15 out=16S_15mers.fa ow minkpp=1 maxkpp=1 rcomp=f

# Generate 9-mers for small RNA types
kmerfilterset.sh in=5S.fa k=9 out=5S_9mers.fa ow minkpp=1 maxkpp=1 rcomp=f
kmerfilterset.sh in=tRNA.fa k=9 out=tRNA_9mers.fa ow minkpp=1 maxkpp=1 rcomp=f

Parameters:

k=15/9 - Kmer size: 15 for large rRNAs, 9 for small RNAs
minkpp=1 maxkpp=1 - Exactly one kmer per position
rcomp=f - Disables reverse complement consideration
ow - Overwrites existing output files

Output Files

The pipeline generates the following output files:

Extracted RNA Sequences

23S.fa - Large ribosomal subunit RNA sequences
16S.fa - Small ribosomal subunit RNA sequences
5S.fa - 5S ribosomal RNA sequences
tRNA.fa - Transfer RNA sequences

Kmer Sets

23S_15mers.fa - 15-mer set for 23S rRNA classification
16S_15mers.fa - 15-mer set for 16S rRNA classification
5S_9mers.fa - 9-mer set for 5S rRNA classification
tRNA_9mers.fa - 9-mer set for tRNA classification

Usage Example

# Prepare directory structure with genome files
mkdir -p genomes/genome1 genomes/genome2
# Place paired .fna.gz and .gff.gz files in each subdirectory

# Run the pipeline
./cutRna.sh

# Use generated kmer sets for classification
bbduk.sh in=query.fa ref=16S_15mers.fa k=15 stats=stats.txt

Implementation Notes

Kmer Size Selection: 15-mers provide specificity for large rRNAs (16S, 23S), while 9-mers are sufficient for shorter sequences (5S, tRNA)
Classification Logic: Sequences sharing kmers with the set are likely of that RNA type; sequences with no kmer matches are probably not
Silva Integration: While this pipeline creates rRNA kmer sets, Silva database is recommended for 16S and 23S classification in production
File Organization: Input files must follow the pattern */*.fna.gz with corresponding .gff.gz files
Memory Requirements: Moderate memory usage depending on genome collection size

Related Tools

cutgff.sh - Extracts sequences based on GFF annotations
kmerfilterset.sh - Generates filtered kmer sets from sequences
CallGenes - Uses generated kmer sets for gene calling
bbduk.sh - Can use kmer sets for sequence filtering/classification