SplitRibo

Script: splitribo.sh Package: prok Class: SplitRibo.java

Splits a file of various rRNAs into one file per type (16S, 18S, 5S, 23s).

Basic Usage

splitribo.sh in=<file,file> out=<pattern>

SplitRibo takes ribosomal RNA sequences from mixed databases (such as Silva) and separates them into individual files based on their type: 16S, 18S, 5S, 23S, and specialized mitochondrial/plastid variants.

Parameters

Parameters are organized according to their function in the ribosomal RNA classification and separation process.

Standard parameters

in=<file>
Input file containing mixed ribosomal RNA sequences. Can be FASTA or FASTQ format, compressed or uncompressed.
out=<pattern>
Output file pattern, such as out_#.fa. The # symbol is required and will be substituted by the type name, such as 16S, to make out_16S.fa, for example. Each ribosomal RNA type will be written to a separate file following this pattern.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false.
ziplevel=9
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Default: 9 (maximum compression).
types=16S,18S,5S,23S,m16S,m18S,p16S
Align to these sequences. Fewer types is faster. m16S and m18S are mitochondrial; p16S is plastid (chloroplast). Default includes all major ribosomal RNA types.

Processing parameters

minid=0.59
Ignore alignments with identity lower than this to a consensus sequences. Sequences with alignment identity below this threshold will be classified as "Other". Default: 0.59 (59% identity).
refineid=0.70
Refine score by aligning to clade-specific consensus if the best alignment to a universal consensus is below this. This enables more accurate classification by using specialized consensus sequences for borderline matches. Default: 0.70 (70% identity).

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 4g.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic rRNA Separation

splitribo.sh in=mixed_rrna.fasta out=separated_#.fasta

Separates mixed ribosomal RNA sequences into individual files: separated_16S.fasta, separated_18S.fasta, separated_5S.fasta, separated_23S.fasta, separated_m16S.fasta, separated_m18S.fasta, separated_p16S.fasta, and separated_Other.fasta for unclassified sequences.

Custom Type Selection

splitribo.sh in=silva_database.fasta out=classified_#.fa types=16S,18S

Only classifies sequences as 16S or 18S ribosomal RNA, ignoring other types for faster processing.

Strict Classification

splitribo.sh in=rrna_seqs.fq out=strict_#.fastq minid=0.75 refineid=0.85

Uses stricter identity thresholds for classification, requiring higher similarity to consensus sequences before assignment to a specific ribosomal RNA type.

High-throughput Processing

splitribo.sh in=large_dataset.fasta.gz out=output_#.fasta.gz -Xmx32g overwrite=t

Processes a large compressed dataset with increased memory allocation and allows overwriting of existing output files.

Algorithm Details

Classification Strategy

SplitRibo employs a two-stage alignment approach for accurate ribosomal RNA classification:

Stage 1: Universal Consensus Alignment

Each input sequence is first aligned against universal consensus sequences for each ribosomal RNA type using IDAligner from aligner.Factory.makeIDAligner(). The algorithm:

Stage 2: Clade-Specific Refinement

For sequences with identity below refineid threshold (default 0.70) or matching plastid p16S, the algorithm performs refinement:

Consensus Sequence Management

The tool loads consensus sequences through ProkObject.loadConsensusSequenceType() with automatic conflict removal:

Performance Characteristics

SplitRibo implements multi-threaded processing using the ThreadWaiter.startAndWait() framework:

Output Organization

The tool routes sequences using outPattern.replaceFirst("#", type) filename generation:

Quality Control Features

The validateParams() and checkFileExistence() methods enforce data integrity:

Output Format

File Naming Convention

Output files follow the pattern specified in the out parameter:

Statistics Output

Statistics are generated via Tools.timeReadsBasesProcessed() and Tools.readsBasesOut() methods output to outstream:

Time:                         1.234 seconds.
Reads Processed:         10000   10.00 Mreads
Bases Processed:    5000000    5.00 Mbases

Type        Count
Other          15
16S          8234
18S          1456
5S            123
23S           156
m16S           12
m18S            3
p16S            1

Support

For questions and support: