AddSSU

Basic Usage

addssu.sh in=a.sketch out=b.sketch 16S=16S.fa 18S=18S.fa

This tool modifies existing sketch files by adding, removing, or replacing Small Subunit ribosomal RNA (SSU) sequences through line-by-line streaming processing. Input sketches must contain "#SZ:" header lines with taxonomic metadata, while SSU fasta files require sequence identifiers in the "tid|#" format where # represents the NCBI taxonomic identifier. The tool implements a state-machine architecture using the SketchHeader inner class with ArrayList<String> field storage and ByteBuilder reconstruction via toBytes() method, processing each header atomically through the processHeader() method based on TaxID matching via parseTaxID().

Parameters

Parameters are organized into standard input/output files, additional SSU files, processing control flags, and Java runtime settings.

Standard parameters

in=<file>: Input sketch file. Required parameter specifying the sketch file to be processed.
out=<file>: Output sketch file. Required parameter specifying where to write the modified sketch.

Additional file parameters (optional)

16S=<file>: A fasta file of 16S sequences. These should be renamed so that they start with tid|# where # is the taxID. Should not contain organelle rRNA. Can be set to "auto" to use the default 16S database.
18S=<file>: A fasta file of 18S sequences. These should be renamed so that they start with tid|# where # is the taxID. Should not contain organelle rRNA. Can be set to "auto" to use the default 18S database.
tree=auto: Path to TaxTree, if performing prokaryote/eukaryote-specific operations. Set to "auto" to use the default taxonomy tree. Required for taxonomy-aware processing flags.

Processing parameters

preferSSUMap=f: When true, prefer SSU sequences from the SSU map over existing sequences in sketches for all organisms. Default: false.
preferSSUMapEuks=f: When true, prefer SSU sequences from the SSU map over existing sequences in sketches, but only for eukaryotic organisms. Requires a taxonomy tree. Default: false.
preferSSUMapProks=f: When true, prefer SSU sequences from the SSU map over existing sequences in sketches, but only for prokaryotic organisms. Requires a taxonomy tree. Default: false.
SSUMapOnly=f: When true, use only SSU sequences from the SSU map, replacing all existing SSU sequences in sketches for all organisms. Default: false.
SSUMapOnlyEuks=f: When true, use only SSU sequences from the SSU map for eukaryotic organisms, clearing existing SSU sequences first. Requires a taxonomy tree. Default: false.
SSUMapOnlyProks=f: When true, use only SSU sequences from the SSU map for prokaryotic organisms, clearing existing SSU sequences first. Requires a taxonomy tree. Default: false.
clear16S=f: When true, remove all existing 16S sequences from all sketches. Default: false.
clear18S=f: When true, remove all existing 18S sequences from all sketches. Default: false.
clear16SEuks=f: When true, remove existing 16S sequences from eukaryotic organisms only. Requires a taxonomy tree. Default: false.
clear18SEuks=f: When true, remove existing 18S sequences from eukaryotic organisms only. Requires a taxonomy tree. Default: false.
clear16SProks=f: When true, remove existing 16S sequences from prokaryotic organisms only. Requires a taxonomy tree. Default: false.
clear18SProks=f: When true, remove existing 18S sequences from prokaryotic organisms only. Requires a taxonomy tree. Default: false.
clearAll=f: When true, remove all existing SSU sequences (both 16S and 18S) from all sketches. Equivalent to setting both clear16S=t and clear18S=t. Default: false.
lines=<number>: Maximum number of lines to process from the input sketch file. Set to -1 or omit for unlimited processing. Default: unlimited.
verbose=f: Enable verbose output for debugging and detailed processing information. Shows SSU map loading details and per-sketch processing information. Default: false.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-da: Disable assertions.

Examples

Basic SSU Addition

addssu.sh in=input.sketch out=output.sketch 16S=bacterial_16S.fa 18S=eukaryotic_18S.fa

Adds 16S and 18S sequences to sketches that don't already have them, using the provided fasta files.

Replace All SSU Sequences

addssu.sh in=input.sketch out=output.sketch 16S=auto 18S=auto preferSSUMap=t

Replace existing SSU sequences with those from the default SSU database for all organisms.

Clear SSU from Prokaryotes Only

addssu.sh in=input.sketch out=output.sketch tree=auto clear16SProks=t clear18SProks=t

Remove all SSU sequences from prokaryotic organisms while preserving eukaryotic SSU sequences.

Eukaryote-Specific SSU Replacement

addssu.sh in=input.sketch out=output.sketch 18S=euk_18S.fa tree=auto preferSSUMapEuks=t

Replace 18S sequences in eukaryotic organisms only, using the provided fasta file and preferring map sequences over existing ones.

Verbose Processing with Line Limit

addssu.sh in=large.sketch out=sample.sketch 16S=auto lines=1000 verbose=t

Process only the first 1000 lines of a large sketch file with detailed output for debugging.

Algorithm Details

AddSSU implements a line-by-line streaming sketch processor using the processInner() method with ByteFile.nextLine() input and ByteStreamWriter output. The tool processes sketch files through a state-machine parser that distinguishes header lines (starting with '#') from sequence data using byte-level pattern matching, managing header state transitions via the SketchHeader inner class for atomic sketch metadata reconstruction.

Core Processing Architecture

The algorithm employs a line-by-line streaming processor with header state management:

Line Classification: Input lines are classified as header lines (starting with '#') or sequence data using byte-level pattern matching
Header State Machine: SketchHeader objects are created when "#SZ:" lines are encountered, accumulating subsequent "#16S:" and "#18S:" lines via the addLine() method
Atomic Processing: Complete headers are buffered using the SketchHeader class and processed atomically via processHeader() before output generation
ByteBuilder Output: Headers are reconstructed using the ByteBuilder class through the toBytes() method with tab-separated field formatting

Taxonomy-Aware Processing

Organism-specific processing uses TaxTree integration with TaxID-based lookups:

TaxID Extraction: parseTaxID() method uses Tools.tabPattern.split() to parse header fields, scanning for "ID:" or "TAXID:" prefixes with Integer.parseInt() conversion
Prokaryote Detection: TaxTree.isProkaryote(header.tid) with tid>0 && tid<SketchObject.minFakeID validation identifies bacterial and archaeal organisms for conditional processing
Eukaryote Detection: TaxTree.isEukaryote(header.tid) with tid>0 && tid<SketchObject.minFakeID validation identifies eukaryotic organisms with specialized SSU handling
Fake ID Filtering: TaxIDs are validated against SketchObject.minFakeID threshold using comparison operators to exclude synthetic identifiers

SSU Map Integration

The SSUMap class provides HashMap-based access to curated rRNA sequences with automatic loading:

Lazy Loading: SSUMap.load(outstream) is called once via processInner() line 324, initializing r16SMap and r18SMap HashMaps with synchronized access
HashMap Retrieval: SSUMap.r16SMap.get(header.tid) and SSUMap.r18SMap.get(header.tid) provide O(1) TaxID-based sequence lookup using Integer keys
Preference Logic: preferMap boolean variable controls whether map sequences override existing header.r16S and header.r18S byte arrays through conditional assignment
Auto File Resolution: "auto" string parameters trigger TaxTree.default16SFile() and TaxTree.default18SFile() static method calls for path resolution in lines 164-165

Memory Management Strategy

Processing uses streaming architecture with controlled memory allocation:

Single-Pass Processing: ByteFile.nextLine() provides streaming input reading one line at a time without loading entire sketch files into memory
Header Buffering: SketchHeader instances temporarily buffer complete headers using ArrayList<String> fields collection initialized with line.length()+2 capacity
ByteBuilder Efficiency: Output reconstruction uses ByteBuilder(1000) constructor with pre-allocated 1000-byte capacity for tab-separated string concatenation
Line Limiting: maxLines parameter (default Long.MAX_VALUE) enables early termination via break statement when linesProcessed>=maxLines

Header Reconstruction Algorithm

The toBytes() method reconstructs sketch headers with precise formatting preservation:

Field Preservation: Original header fields are maintained in ArrayList iteration order using bb.tab().append(fields.get(i)) with tab separator insertion
SSU Length Calculation: 16S and 18S length fields are dynamically calculated using r16S.length and r18S.length properties appended as "16S:" + length format
Sequence Line Generation: "#16S:" and "#18S:" prefixed lines are appended using ByteBuilder.nl().append("#16S:").append(r16S) method chaining
Format Compatibility: Output maintains exact sketch file format using '#' prefix, tab separation, and newline generation for downstream tool compatibility

Processing Statistics Tracking

The algorithm maintains comprehensive processing counters for operational monitoring:

Input Tracking: r16Sin and r18Sin long counters increment via r16Sin++ and r18Sin++ when parsing existing SSU sequences from addLine() method
Map Addition Tracking: r16SfromMap and r18SfromMap long counters increment via r16SfromMap++ and r18SfromMap++ when sequences are added from SSUMap HashMap lookups
Output Verification: r16Sout and r18Sout long counters track final SSU content using ternary operators (header.r16S==null ? 0 : 1) for null-safe counting
Byte/Line Accounting: bytesProcessed, linesProcessed, bytesOut, and linesOut provide throughput metrics via += operators with line.length+1 calculations

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Guide: BBSketchGuide.txt