cutRna.sh

Script: cutRna.sh Author: Brian Bushnell Last Updated: August 8, 2019

Extracts ribosomal RNA (16S, 23S, 5S) and tRNA sequences from annotated genomes with paired FASTA (.fna.gz) and GFF (.gff.gz) files. Creates specialized kmer sets for sequence classification where every sequence contains at least one kmer from the set. Used by CallGenes pipeline for RNA identification.

Purpose

This pipeline extracts specific RNA types from annotated genomes to generate kmer sets for sequence classification. The resulting kmer sets enable detection of sequences that lack matches - indicating they are likely not of the target RNA type. While CallGenes uses this pipeline for general RNA detection, Silva database is preferred for 16S and 23S ribosomal RNA identification.

Prerequisites

Pipeline Stages

Stage 1: RNA Sequence Extraction

Uses cutgff.sh to extract different RNA types from annotated genomes:

# Extract 23S ribosomal RNA
cutgff.sh fastawrap=10k types=rRNA out=23S.fa attributes=23S */*.fna.gz

# Extract 16S ribosomal RNA  
cutgff.sh fastawrap=10k types=rRNA out=16S.fa attributes=16S */*.fna.gz

# Extract 5S ribosomal RNA
cutgff.sh fastawrap=10k types=rRNA out=5S.fa attributes=5S */*.fna.gz

# Extract tRNA sequences
cutgff.sh fastawrap=10k types=tRNA out=tRNA.fa */*.fna.gz

Parameters:

Stage 2: Kmer Set Generation

Creates unique kmer sets from extracted RNA sequences using different kmer sizes optimized for each RNA type:

# Generate 15-mers for large ribosomal RNAs
kmerfilterset.sh in=23S.fa k=15 out=23S_15mers.fa ow minkpp=1 maxkpp=1 rcomp=f
kmerfilterset.sh in=16S.fa k=15 out=16S_15mers.fa ow minkpp=1 maxkpp=1 rcomp=f

# Generate 9-mers for small RNA types
kmerfilterset.sh in=5S.fa k=9 out=5S_9mers.fa ow minkpp=1 maxkpp=1 rcomp=f
kmerfilterset.sh in=tRNA.fa k=9 out=tRNA_9mers.fa ow minkpp=1 maxkpp=1 rcomp=f

Parameters:

Output Files

The pipeline generates the following output files:

Extracted RNA Sequences

Kmer Sets

Usage Example

# Prepare directory structure with genome files
mkdir -p genomes/genome1 genomes/genome2
# Place paired .fna.gz and .gff.gz files in each subdirectory

# Run the pipeline
./cutRna.sh

# Use generated kmer sets for classification
bbduk.sh in=query.fa ref=16S_15mers.fa k=15 stats=stats.txt

Implementation Notes

Related Tools