MergeRibo

Script: mergeribo.sh Package: prok Class: MergeRibo.java

Merges files of SSU sequences to keep one per taxID. By default, a consensus is generated per TaxID, then the sequence best matching that consensus is used: First, all sequences per TaxID are aligned to a reference consensus. Second, the best-matching sequence is used as a seed, and all other sequences for that TaxID are aligned to the seed to generate a new consensus. Third, in 'consensus' mode, that consensus is simply output. In 'best' mode (default), all sequences are aligned again to the new consensus, and the best-matching is output.

Basic Usage

mergeribo.sh in=<file,file> out=<file>

MergeRibo processes ribosomal RNA sequences (16S or 18S) to produce a single representative sequence per taxonomic ID. This is particularly useful for creating non-redundant databases or consensus sequences from multiple SSU rRNA sequences.

Parameters

Parameters are organized by their function in the sequence merging process. The tool follows a multi-step algorithm: alignment to reference consensus, seed selection, individual consensus generation, and best sequence selection.

Standard parameters

in=<file,file>: Comma-delimited list of input FASTA files containing SSU sequences. Each sequence must have a taxonomic ID in the header.
out=<file>: Output file containing one representative sequence per taxID. Format is FASTA.
out2=<file>: Read 2 output if reads are in two files. Not typically used for SSU sequences.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file.
showspeed=t: (ss) Set to 'f' to suppress display of processing speed statistics during execution.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
fastawrap=70: FASTA line wrap length. 4000 is recommended to minimize filesize for long sequences.

Processing parameters

alt=<file>: Lower priority data file. Only used if there is no SSU sequence associated with a TaxID from the primary input files.
best=t: Output the best representative sequence per taxID after consensus generation. This is the default mode.
consensus=f: Output a consensus sequence per taxID instead of the best input sequence. Mutually exclusive with best=t.
fast=f: Output the best sequence based on alignment to global consensus (the seed) rather than generating individual consensus sequences. Faster but potentially less accurate.
minid=0.62: Ignore sequences with identity lower than this threshold when aligned to the global consensus (16S or 18S reference).
maxns=-1: Ignore sequences with more than this many N bases, if non-negative. Set to -1 to disable filtering.
minlen=1: Ignore sequences shorter than this length in bases.
maxlen=4000: Ignore sequences longer than this length in bases. Typical full-length 16S is ~1500bp.
16S=t: Align sequences to 16S rRNA consensus to pick the seed sequence. Mutually exclusive with 18S.
18S=f: Align sequences to 18S rRNA consensus to pick the seed sequence. Mutually exclusive with 16S.
level=: If specified with a taxonomic term like 'species' or 'genus', nodes will be promoted to that level minimum before consensus generation. Requires taxonomy tree.
dada2=f: Output headers in DADA2 format with complete taxonomic lineages instead of simple sequence IDs.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions for slightly improved performance in production runs.

Examples

Basic SSU Merging

mergeribo.sh in=ssu_sequences.fa out=merged_ssu.fa

Merge SSU sequences from a single file, producing one representative 16S sequence per taxonomic ID.

Multiple Input Files

mergeribo.sh in=file1.fa,file2.fa,file3.fa out=consensus.fa minid=0.8

Process multiple input files with higher identity threshold for more stringent filtering.

18S Processing

mergeribo.sh in=euk_ssu.fa out=merged_18s.fa 18S=t 16S=f

Process 18S rRNA sequences from eukaryotes instead of bacterial 16S.

Consensus Mode with Quality Filtering

mergeribo.sh in=raw_ssu.fa out=consensus_ssu.fa consensus=t best=f minlen=1200 maxlen=1600 maxns=10

Generate consensus sequences instead of best representatives, with length filtering for full-length 16S and N-base filtering.

Species-Level Grouping with DADA2 Output

mergeribo.sh in=amplicons.fa out=species_reps.fa level=species dada2=t

Group sequences at species level and output with DADA2-compatible taxonomic headers.

Algorithm Details

Multi-Stage Consensus Generation

MergeRibo implements a three-stage algorithm for selecting representative sequences per taxonomic ID:

Global Alignment Phase: All input sequences are aligned against a reference consensus (16S or 18S) using SingleStateAlignerFlat2. Sequences below the minid threshold are filtered out.
Seed Selection: The best-scoring sequence per taxID becomes the seed, based on a composite score combining sequence length and alignment identity.
Consensus Generation: All sequences for each taxID are aligned to their respective seed to generate a BaseGraph consensus using configurable parameters (MAF_sub=0.251, trimDepthFraction=0.3).

Scoring Algorithm

The tool uses a length-weighted identity score implemented in the score(int len, float identity) method: score = lengthMult(length) × identity. The lengthMult(int len) method calculates a penalty ratio as min(len, idealLength) / max(len, idealLength, 1), where idealLength corresponds to the loaded consensus sequence length (16S or 18S). This penalizes sequences that deviate significantly from the expected ribosomal gene length.

Memory Management

The implementation uses several memory optimization strategies:

ConcurrentLinkedQueue<ArrayList<Ribo>> for thread-safe parallel processing of taxID groups
HashMap<Integer, ArrayList<Ribo>> with initial capacity of 100,000 for taxID-to-sequences mapping
ListComparator sorts lists by size (largest first) in makeQueue() method to minimize load balancing issues
Dual-pass processing: first pass populates listMap via alignment, second pass processes via pickBest()
Default memory allocation configured as z="-Xmx4g" and z2="-Xms4g" with automatic scaling via calcXmx() method based on available RAM using freeRam 4000m 42
IntHashSet seenTaxID with initial capacity of 1,000,000 for tracking processed taxonomic IDs

Consensus vs Best Mode

In best mode (default), the final output is the input sequence that best matches the generated consensus. In consensus mode, the actual consensus sequence is output. Best mode preserves original sequence quality scores and annotations while consensus mode may produce sequences with higher accuracy in variable regions.

Fast Mode Optimization

When fast=t, sequences are ranked by their alignment to the global consensus (seed) rather than generating individual consensus sequences per taxID. This significantly reduces computation time but may produce less accurate representatives for highly divergent sequences within a taxID.

Taxonomic Level Processing

When the level parameter is specified, the processRead() method calls tree.getIdAtLevelExtended(key, taxLevelE) to promote sequences to the specified taxonomic level before consensus generation. This uses the TaxTree implementation to traverse taxonomic hierarchies and requires a taxonomy tree to be loaded via TaxTree.loadTaxTree().

Consensus Sequence Loading

Reference consensus sequences are loaded via ProkObject.loadConsensusSequenceType() for either "16S" or "18S" types. The consensus sequences are stored as byte arrays (consensus16S, consensus18S) and used by SingleStateAlignerFlat2 for initial sequence filtering and scoring.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org