CompareSSU

Basic Usage

comparessu.sh in=<input file> out=<output file>

Input may be fasta or fastq, compressed or uncompressed. Sequences must have taxonomic IDs (taxID) in their headers for proper comparison.

Parameters

Parameters are organized into three main categories: standard file I/O and processing options, sequence filtering parameters, and Java runtime settings.

Standard parameters

in=<file>: Input sequences. Must be fasta or fastq format, can be compressed or uncompressed.
out=<file>: Output data file containing comparison results with identity scores.
t=: Set the number of threads; default is logical processors available on the system.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file.
showspeed=t: (ss) Set to 'f' to suppress display of processing speed statistics during execution.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
reads=-1: If positive, quit after processing this many input sequences. Default -1 processes all sequences.

Processing parameters

ata=f: Do an all-to-all comparison. When false (default), each sequence will only be compared to one other randomly-selected sequence per taxonomic level, which is much faster for large datasets.
minlen=0: Ignore sequences shorter than this length threshold. Useful for filtering out very short sequences that may not provide meaningful comparisons.
maxlen=BIG: Ignore sequences longer than this length threshold. Default is effectively unlimited (Integer.MAX_VALUE).
maxns=-1: If positive, ignore sequences with more than this many N bases. Default -1 accepts sequences regardless of N content.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic SSU Comparison

comparessu.sh in=ssu_sequences.fasta out=identity_results.txt

Compare SSU sequences in fractional mode (default), where each sequence is compared to one randomly selected sequence per taxonomic level.

All-to-All Comparison

comparessu.sh in=ssu_sequences.fasta out=all_comparisons.txt ata=t

Perform exhaustive all-to-all comparison of all SSU sequences. This is computationally intensive and should be used with smaller datasets.

Filtered Comparison

comparessu.sh in=ssu_sequences.fasta out=filtered_results.txt minlen=500 maxns=10

Compare only sequences that are at least 500bp long and contain no more than 10 N bases.

Multi-threaded Processing

comparessu.sh in=large_ssu_dataset.fasta out=results.txt t=16

Use 16 threads for faster processing of large SSU datasets.

Algorithm Details

CompareSSU implements a taxonomic-aware alignment algorithm specifically designed for Small Subunit ribosomal RNA sequences. The tool requires sequences to be annotated with taxonomic IDs (taxID) in their headers for proper phylogenetic comparison.

Core Algorithm

The comparison process uses a multi-threaded approach where each thread processes sequences independently:

Sequence Loading: SSU sequences are loaded from SSUMap.r16SMap HashMap<Integer, byte[]> and stored as Read objects with numeric taxonomic IDs
Taxonomic Tree Construction: Uses TaxTree.loadTaxTree() to construct phylogenetic relationships, with TaxNode objects tracking evolutionary hierarchy
Alignment Strategy: Uses SketchObject.align() method for pairwise sequence alignment, computing float identity scores between byte arrays
Comparison Modes: Supports both fractional (ata=false, default) and all-to-all (ata=true) comparison strategies with AtomicInteger coordination

Comparison Strategy

In fractional mode (ata=false), the algorithm uses Collections.shuffle() on the sequence list every 5 queries (querysProcessedT%5==0) to ensure representative sampling across taxonomic levels. Each query sequence uses tree.commonAncestor(qid, rid) to find shared ancestors and applies bit mask checking (mask&seen)==0 to compare sequences only once per taxonomic level, reducing complexity from O(n²) to O(n).

In all-to-all mode (ata=true), every sequence is compared to every other sequence using nested loops, providing complete pairwise identity information but at O(n²) computational cost.

Output Format

The output contains tab-delimited results with columns:

Level: Taxonomic level (strain, species, genus, family, order, class, phylum, superkingdom, life)
Identity: Sequence identity score (0.0 to 1.0)
QueryID: Taxonomic ID of the query sequence
RefID: Taxonomic ID of the reference sequence

Performance Characteristics

The tool includes several performance optimizations:

Multi-threading: ProcessThread instances with ThreadWaiter coordination using ArrayList<ProcessThread> management
Memory Management: Uses ByteStreamWriter for output buffering and ArrayList<Read> listCopy for thread-local sequence access
Random Sampling: Fractional comparison mode reduces computational complexity from O(n²) to O(n) using bit masks and ancestor checking
Sequence Filtering: Length and N-content filters using r.countNocalls()<=maxns eliminate low-quality sequences before comparison

Memory usage defaults to 4GB via calcmem.sh freeRam calculation with 84% of available RAM allocation, adjustable using the -Xmx parameter for larger datasets.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org