SplitRibo
Splits a file of various rRNAs into one file per type (16S, 18S, 5S, 23s).
Basic Usage
splitribo.sh in=<file,file> out=<pattern>
SplitRibo takes ribosomal RNA sequences from mixed databases (such as Silva) and separates them into individual files based on their type: 16S, 18S, 5S, 23S, and specialized mitochondrial/plastid variants.
Parameters
Parameters are organized according to their function in the ribosomal RNA classification and separation process.
Standard parameters
- in=<file>
- Input file containing mixed ribosomal RNA sequences. Can be FASTA or FASTQ format, compressed or uncompressed.
- out=<pattern>
- Output file pattern, such as out_#.fa. The # symbol is required and will be substituted by the type name, such as 16S, to make out_16S.fa, for example. Each ribosomal RNA type will be written to a separate file following this pattern.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false.
- ziplevel=9
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Default: 9 (maximum compression).
- types=16S,18S,5S,23S,m16S,m18S,p16S
- Align to these sequences. Fewer types is faster. m16S and m18S are mitochondrial; p16S is plastid (chloroplast). Default includes all major ribosomal RNA types.
Processing parameters
- minid=0.59
- Ignore alignments with identity lower than this to a consensus sequences. Sequences with alignment identity below this threshold will be classified as "Other". Default: 0.59 (59% identity).
- refineid=0.70
- Refine score by aligning to clade-specific consensus if the best alignment to a universal consensus is below this. This enables more accurate classification by using specialized consensus sequences for borderline matches. Default: 0.70 (70% identity).
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 4g.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic rRNA Separation
splitribo.sh in=mixed_rrna.fasta out=separated_#.fasta
Separates mixed ribosomal RNA sequences into individual files: separated_16S.fasta, separated_18S.fasta, separated_5S.fasta, separated_23S.fasta, separated_m16S.fasta, separated_m18S.fasta, separated_p16S.fasta, and separated_Other.fasta for unclassified sequences.
Custom Type Selection
splitribo.sh in=silva_database.fasta out=classified_#.fa types=16S,18S
Only classifies sequences as 16S or 18S ribosomal RNA, ignoring other types for faster processing.
Strict Classification
splitribo.sh in=rrna_seqs.fq out=strict_#.fastq minid=0.75 refineid=0.85
Uses stricter identity thresholds for classification, requiring higher similarity to consensus sequences before assignment to a specific ribosomal RNA type.
High-throughput Processing
splitribo.sh in=large_dataset.fasta.gz out=output_#.fasta.gz -Xmx32g overwrite=t
Processes a large compressed dataset with increased memory allocation and allows overwriting of existing output files.
Algorithm Details
Classification Strategy
SplitRibo employs a two-stage alignment approach for accurate ribosomal RNA classification:
Stage 1: Universal Consensus Alignment
Each input sequence is first aligned against universal consensus sequences for each ribosomal RNA type using IDAligner from aligner.Factory.makeIDAligner(). The algorithm:
- Calculates alignment identity for each sequence against all target types using ssa.align(r.bases, ref.bases)
- Selects the best-matching type if identity exceeds the minid threshold (default 0.59)
- Processes only the first consensus sequence (index 0-1) for initial broad classification
Stage 2: Clade-Specific Refinement
For sequences with identity below refineid threshold (default 0.70) or matching plastid p16S, the algorithm performs refinement:
- Processes all consensus sequences (index 1 through refs.length) for comprehensive clade-specific alignment
- Iterates through consensusSequences[type] array containing multiple reference sequences per type
- Automatically triggers refinement for p16S_index matches to distinguish chloroplast from bacterial 16S
Consensus Sequence Management
The tool loads consensus sequences through ProkObject.loadConsensusSequenceType() with automatic conflict removal:
- Loads sequences into consensusSequences[numTypes] Read[][] array structure
- Calculates m16S_index, m18S_index, p16S_index using Tools.find() for specialized handling
- Strips mitochondrial sequences from 16S/18S when stripM16S/stripM18S flags are true
- Strips plastid sequences from 16S when stripP16S flag is true to reduce cross-contamination
Performance Characteristics
SplitRibo implements multi-threaded processing using the ThreadWaiter.startAndWait() framework:
- Memory Usage: Default z="-Xmx4g" via calcXmx() with freeRam calculation for auto-sizing
- Threading: Spawns Shared.threads() ProcessThread instances processing ConcurrentReadInputStream
- I/O Implementation: Uses ConcurrentReadOutputStream[] array with configurable buffer sizes via Tools.mid(2, 16, (Shared.threads()*2)/3)
- Format Support: FileFormat.testInput() handles FASTA/FASTQ detection with compression via ReadWrite.USE_PIGZ
Output Organization
The tool routes sequences using outPattern.replaceFirst("#", type) filename generation:
- Sequences are distributed to out[type] ArrayList<Read> arrays by processRead() return value
- Sequences with identity < minID return type 0 ("Other") in processRead() logic
- Original Read objects are preserved with r.obj=bestID for score tracking
- Statistics accumulation via Tools.add(readsOut, pt.readsOutT) across all ProcessThread instances
Quality Control Features
The validateParams() and checkFileExistence() methods enforce data integrity:
- Validates outPattern.contains("#") requirement during checkFileExistence() execution
- Tests file accessibility via Tools.testInputFiles() and Tools.testOutputFiles() before processing
- Implements Tools.testForDuplicateFiles() to prevent I/O conflicts
- Generates statistics via Tools.timeReadsBasesProcessed() and per-type counting in readsOut[] arrays
Output Format
File Naming Convention
Output files follow the pattern specified in the out parameter:
out_16S.fasta
- Bacterial/archaeal small subunit rRNAout_18S.fasta
- Eukaryotic small subunit rRNAout_5S.fasta
- Small ribosomal RNA subunitout_23S.fasta
- Bacterial/archaeal large subunit rRNAout_m16S.fasta
- Mitochondrial small subunit rRNAout_m18S.fasta
- Mitochondrial large subunit rRNAout_p16S.fasta
- Plastid (chloroplast) small subunit rRNAout_Other.fasta
- Unclassified sequences
Statistics Output
Statistics are generated via Tools.timeReadsBasesProcessed() and Tools.readsBasesOut() methods output to outstream:
Time: 1.234 seconds.
Reads Processed: 10000 10.00 Mreads
Bases Processed: 5000000 5.00 Mbases
Type Count
Other 15
16S 8234
18S 1456
5S 123
23S 156
m16S 12
m18S 3
p16S 1
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org