SplitRibo

Basic Usage

splitribo.sh in=<file,file> out=<pattern>

SplitRibo takes ribosomal RNA sequences from mixed databases (such as Silva) and separates them into individual files based on their type: 16S, 18S, 5S, 23S, and specialized mitochondrial/plastid variants.

Parameters

Parameters are organized according to their function in the ribosomal RNA classification and separation process.

Standard parameters

in=<file>: Input file containing mixed ribosomal RNA sequences. Can be FASTA or FASTQ format, compressed or uncompressed.
out=<pattern>: Output file pattern, such as out_#.fa. The # symbol is required and will be substituted by the type name, such as 16S, to make out_16S.fa, for example. Each ribosomal RNA type will be written to a separate file following this pattern.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false.
ziplevel=9: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Default: 9 (maximum compression).
types=16S,18S,5S,23S,m16S,m18S,p16S: Align to these sequences. Fewer types is faster. m16S and m18S are mitochondrial; p16S is plastid (chloroplast). Default includes all major ribosomal RNA types.

Processing parameters

minid=0.59: Ignore alignments with identity lower than this to a consensus sequences. Sequences with alignment identity below this threshold will be classified as "Other". Default: 0.59 (59% identity).
refineid=0.70: Refine score by aligning to clade-specific consensus if the best alignment to a universal consensus is below this. This enables more accurate classification by using specialized consensus sequences for borderline matches. Default: 0.70 (70% identity).

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 4g.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic rRNA Separation

splitribo.sh in=mixed_rrna.fasta out=separated_#.fasta

Separates mixed ribosomal RNA sequences into individual files: separated_16S.fasta, separated_18S.fasta, separated_5S.fasta, separated_23S.fasta, separated_m16S.fasta, separated_m18S.fasta, separated_p16S.fasta, and separated_Other.fasta for unclassified sequences.

Custom Type Selection

splitribo.sh in=silva_database.fasta out=classified_#.fa types=16S,18S

Only classifies sequences as 16S or 18S ribosomal RNA, ignoring other types for faster processing.

Strict Classification

splitribo.sh in=rrna_seqs.fq out=strict_#.fastq minid=0.75 refineid=0.85

Uses stricter identity thresholds for classification, requiring higher similarity to consensus sequences before assignment to a specific ribosomal RNA type.

High-throughput Processing

splitribo.sh in=large_dataset.fasta.gz out=output_#.fasta.gz -Xmx32g overwrite=t

Processes a large compressed dataset with increased memory allocation and allows overwriting of existing output files.

Algorithm Details

Classification Strategy

SplitRibo employs a two-stage alignment approach for accurate ribosomal RNA classification:

Stage 1: Universal Consensus Alignment

Each input sequence is first aligned against universal consensus sequences for each ribosomal RNA type using IDAligner from aligner.Factory.makeIDAligner(). The algorithm:

Calculates alignment identity for each sequence against all target types using ssa.align(r.bases, ref.bases)
Selects the best-matching type if identity exceeds the minid threshold (default 0.59)
Processes only the first consensus sequence (index 0-1) for initial broad classification

Stage 2: Clade-Specific Refinement

For sequences with identity below refineid threshold (default 0.70) or matching plastid p16S, the algorithm performs refinement:

Processes all consensus sequences (index 1 through refs.length) for comprehensive clade-specific alignment
Iterates through consensusSequences[type] array containing multiple reference sequences per type
Automatically triggers refinement for p16S_index matches to distinguish chloroplast from bacterial 16S

Consensus Sequence Management

The tool loads consensus sequences through ProkObject.loadConsensusSequenceType() with automatic conflict removal:

Loads sequences into consensusSequences[numTypes] Read[][] array structure
Calculates m16S_index, m18S_index, p16S_index using Tools.find() for specialized handling
Strips mitochondrial sequences from 16S/18S when stripM16S/stripM18S flags are true
Strips plastid sequences from 16S when stripP16S flag is true to reduce cross-contamination

Performance Characteristics

SplitRibo implements multi-threaded processing using the ThreadWaiter.startAndWait() framework:

Memory Usage: Default z="-Xmx4g" via calcXmx() with freeRam calculation for auto-sizing
Threading: Spawns Shared.threads() ProcessThread instances processing ConcurrentReadInputStream
I/O Implementation: Uses ConcurrentReadOutputStream[] array with configurable buffer sizes via Tools.mid(2, 16, (Shared.threads()*2)/3)
Format Support: FileFormat.testInput() handles FASTA/FASTQ detection with compression via ReadWrite.USE_PIGZ

Output Organization

The tool routes sequences using outPattern.replaceFirst("#", type) filename generation:

Sequences are distributed to out[type] ArrayList<Read> arrays by processRead() return value
Sequences with identity < minID return type 0 ("Other") in processRead() logic
Original Read objects are preserved with r.obj=bestID for score tracking
Statistics accumulation via Tools.add(readsOut, pt.readsOutT) across all ProcessThread instances

Quality Control Features

The validateParams() and checkFileExistence() methods enforce data integrity:

Validates outPattern.contains("#") requirement during checkFileExistence() execution
Tests file accessibility via Tools.testInputFiles() and Tools.testOutputFiles() before processing
Implements Tools.testForDuplicateFiles() to prevent I/O conflicts
Generates statistics via Tools.timeReadsBasesProcessed() and per-type counting in readsOut[] arrays

Output Format

File Naming Convention

Output files follow the pattern specified in the out parameter:

out_16S.fasta - Bacterial/archaeal small subunit rRNA
out_18S.fasta - Eukaryotic small subunit rRNA
out_5S.fasta - Small ribosomal RNA subunit
out_23S.fasta - Bacterial/archaeal large subunit rRNA
out_m16S.fasta - Mitochondrial small subunit rRNA
out_m18S.fasta - Mitochondrial large subunit rRNA
out_p16S.fasta - Plastid (chloroplast) small subunit rRNA
out_Other.fasta - Unclassified sequences

Statistics Output

Statistics are generated via Tools.timeReadsBasesProcessed() and Tools.readsBasesOut() methods output to outstream:

Time:                         1.234 seconds.
Reads Processed:         10000   10.00 Mreads
Bases Processed:    5000000    5.00 Mbases

Type        Count
Other          15
16S          8234
18S          1456
5S            123
23S           156
m16S           12
m18S            3
p16S            1

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org