CompareSSU
Aligns SSUs to each other and reports identity. This requires sequences annotated with a taxID in their header.
Basic Usage
comparessu.sh in=<input file> out=<output file>
Input may be fasta or fastq, compressed or uncompressed. Sequences must have taxonomic IDs (taxID) in their headers for proper comparison.
Parameters
Parameters are organized into three main categories: standard file I/O and processing options, sequence filtering parameters, and Java runtime settings.
Standard parameters
- in=<file>
- Input sequences. Must be fasta or fastq format, can be compressed or uncompressed.
- out=<file>
- Output data file containing comparison results with identity scores.
- t=
- Set the number of threads; default is logical processors available on the system.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- showspeed=t
- (ss) Set to 'f' to suppress display of processing speed statistics during execution.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
- reads=-1
- If positive, quit after processing this many input sequences. Default -1 processes all sequences.
Processing parameters
- ata=f
- Do an all-to-all comparison. When false (default), each sequence will only be compared to one other randomly-selected sequence per taxonomic level, which is much faster for large datasets.
- minlen=0
- Ignore sequences shorter than this length threshold. Useful for filtering out very short sequences that may not provide meaningful comparisons.
- maxlen=BIG
- Ignore sequences longer than this length threshold. Default is effectively unlimited (Integer.MAX_VALUE).
- maxns=-1
- If positive, ignore sequences with more than this many N bases. Default -1 accepts sequences regardless of N content.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic SSU Comparison
comparessu.sh in=ssu_sequences.fasta out=identity_results.txt
Compare SSU sequences in fractional mode (default), where each sequence is compared to one randomly selected sequence per taxonomic level.
All-to-All Comparison
comparessu.sh in=ssu_sequences.fasta out=all_comparisons.txt ata=t
Perform exhaustive all-to-all comparison of all SSU sequences. This is computationally intensive and should be used with smaller datasets.
Filtered Comparison
comparessu.sh in=ssu_sequences.fasta out=filtered_results.txt minlen=500 maxns=10
Compare only sequences that are at least 500bp long and contain no more than 10 N bases.
Multi-threaded Processing
comparessu.sh in=large_ssu_dataset.fasta out=results.txt t=16
Use 16 threads for faster processing of large SSU datasets.
Algorithm Details
CompareSSU implements a taxonomic-aware alignment algorithm specifically designed for Small Subunit ribosomal RNA sequences. The tool requires sequences to be annotated with taxonomic IDs (taxID) in their headers for proper phylogenetic comparison.
Core Algorithm
The comparison process uses a multi-threaded approach where each thread processes sequences independently:
- Sequence Loading: SSU sequences are loaded from SSUMap.r16SMap HashMap<Integer, byte[]> and stored as Read objects with numeric taxonomic IDs
- Taxonomic Tree Construction: Uses TaxTree.loadTaxTree() to construct phylogenetic relationships, with TaxNode objects tracking evolutionary hierarchy
- Alignment Strategy: Uses SketchObject.align() method for pairwise sequence alignment, computing float identity scores between byte arrays
- Comparison Modes: Supports both fractional (ata=false, default) and all-to-all (ata=true) comparison strategies with AtomicInteger coordination
Comparison Strategy
In fractional mode (ata=false), the algorithm uses Collections.shuffle() on the sequence list every 5 queries (querysProcessedT%5==0) to ensure representative sampling across taxonomic levels. Each query sequence uses tree.commonAncestor(qid, rid) to find shared ancestors and applies bit mask checking (mask&seen)==0 to compare sequences only once per taxonomic level, reducing complexity from O(n²) to O(n).
In all-to-all mode (ata=true), every sequence is compared to every other sequence using nested loops, providing complete pairwise identity information but at O(n²) computational cost.
Output Format
The output contains tab-delimited results with columns:
- Level: Taxonomic level (strain, species, genus, family, order, class, phylum, superkingdom, life)
- Identity: Sequence identity score (0.0 to 1.0)
- QueryID: Taxonomic ID of the query sequence
- RefID: Taxonomic ID of the reference sequence
Performance Characteristics
The tool includes several performance optimizations:
- Multi-threading: ProcessThread instances with ThreadWaiter coordination using ArrayList<ProcessThread> management
- Memory Management: Uses ByteStreamWriter for output buffering and ArrayList<Read> listCopy for thread-local sequence access
- Random Sampling: Fractional comparison mode reduces computational complexity from O(n²) to O(n) using bit masks and ancestor checking
- Sequence Filtering: Length and N-content filters using r.countNocalls()<=maxns eliminate low-quality sequences before comparison
Memory usage defaults to 4GB via calcmem.sh freeRam calculation with 84% of available RAM allocation, adjustable using the -Xmx parameter for larger datasets.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org