CompareSSU

Script: comparessu.sh Package: sketch Class: CompareSSU.java

Aligns SSUs to each other and reports identity. This requires sequences annotated with a taxID in their header.

Basic Usage

comparessu.sh in=<input file> out=<output file>

Input may be fasta or fastq, compressed or uncompressed. Sequences must have taxonomic IDs (taxID) in their headers for proper comparison.

Parameters

Parameters are organized into three main categories: standard file I/O and processing options, sequence filtering parameters, and Java runtime settings.

Standard parameters

in=<file>
Input sequences. Must be fasta or fastq format, can be compressed or uncompressed.
out=<file>
Output data file containing comparison results with identity scores.
t=
Set the number of threads; default is logical processors available on the system.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file.
showspeed=t
(ss) Set to 'f' to suppress display of processing speed statistics during execution.
ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
reads=-1
If positive, quit after processing this many input sequences. Default -1 processes all sequences.

Processing parameters

ata=f
Do an all-to-all comparison. When false (default), each sequence will only be compared to one other randomly-selected sequence per taxonomic level, which is much faster for large datasets.
minlen=0
Ignore sequences shorter than this length threshold. Useful for filtering out very short sequences that may not provide meaningful comparisons.
maxlen=BIG
Ignore sequences longer than this length threshold. Default is effectively unlimited (Integer.MAX_VALUE).
maxns=-1
If positive, ignore sequences with more than this many N bases. Default -1 accepts sequences regardless of N content.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic SSU Comparison

comparessu.sh in=ssu_sequences.fasta out=identity_results.txt

Compare SSU sequences in fractional mode (default), where each sequence is compared to one randomly selected sequence per taxonomic level.

All-to-All Comparison

comparessu.sh in=ssu_sequences.fasta out=all_comparisons.txt ata=t

Perform exhaustive all-to-all comparison of all SSU sequences. This is computationally intensive and should be used with smaller datasets.

Filtered Comparison

comparessu.sh in=ssu_sequences.fasta out=filtered_results.txt minlen=500 maxns=10

Compare only sequences that are at least 500bp long and contain no more than 10 N bases.

Multi-threaded Processing

comparessu.sh in=large_ssu_dataset.fasta out=results.txt t=16

Use 16 threads for faster processing of large SSU datasets.

Algorithm Details

CompareSSU implements a taxonomic-aware alignment algorithm specifically designed for Small Subunit ribosomal RNA sequences. The tool requires sequences to be annotated with taxonomic IDs (taxID) in their headers for proper phylogenetic comparison.

Core Algorithm

The comparison process uses a multi-threaded approach where each thread processes sequences independently:

Comparison Strategy

In fractional mode (ata=false), the algorithm uses Collections.shuffle() on the sequence list every 5 queries (querysProcessedT%5==0) to ensure representative sampling across taxonomic levels. Each query sequence uses tree.commonAncestor(qid, rid) to find shared ancestors and applies bit mask checking (mask&seen)==0 to compare sequences only once per taxonomic level, reducing complexity from O(n²) to O(n).

In all-to-all mode (ata=true), every sequence is compared to every other sequence using nested loops, providing complete pairwise identity information but at O(n²) computational cost.

Output Format

The output contains tab-delimited results with columns:

Performance Characteristics

The tool includes several performance optimizations:

Memory usage defaults to 4GB via calcmem.sh freeRam calculation with 84% of available RAM allocation, adjustable using the -Xmx parameter for larger datasets.

Support

For questions and support: