AddSSU
Adds, removes, or replaces SSU sequence of existing sketches. Sketches and SSU fasta files must be annotated with TaxIDs.
Basic Usage
addssu.sh in=a.sketch out=b.sketch 16S=16S.fa 18S=18S.fa
This tool modifies existing sketch files by adding, removing, or replacing Small Subunit ribosomal RNA (SSU) sequences through line-by-line streaming processing. Input sketches must contain "#SZ:" header lines with taxonomic metadata, while SSU fasta files require sequence identifiers in the "tid|#" format where # represents the NCBI taxonomic identifier. The tool implements a state-machine architecture using the SketchHeader inner class with ArrayList<String> field storage and ByteBuilder reconstruction via toBytes() method, processing each header atomically through the processHeader() method based on TaxID matching via parseTaxID().
Parameters
Parameters are organized into standard input/output files, additional SSU files, processing control flags, and Java runtime settings.
Standard parameters
- in=<file>
- Input sketch file. Required parameter specifying the sketch file to be processed.
- out=<file>
- Output sketch file. Required parameter specifying where to write the modified sketch.
Additional file parameters (optional)
- 16S=<file>
- A fasta file of 16S sequences. These should be renamed so that they start with tid|# where # is the taxID. Should not contain organelle rRNA. Can be set to "auto" to use the default 16S database.
- 18S=<file>
- A fasta file of 18S sequences. These should be renamed so that they start with tid|# where # is the taxID. Should not contain organelle rRNA. Can be set to "auto" to use the default 18S database.
- tree=auto
- Path to TaxTree, if performing prokaryote/eukaryote-specific operations. Set to "auto" to use the default taxonomy tree. Required for taxonomy-aware processing flags.
Processing parameters
- preferSSUMap=f
- When true, prefer SSU sequences from the SSU map over existing sequences in sketches for all organisms. Default: false.
- preferSSUMapEuks=f
- When true, prefer SSU sequences from the SSU map over existing sequences in sketches, but only for eukaryotic organisms. Requires a taxonomy tree. Default: false.
- preferSSUMapProks=f
- When true, prefer SSU sequences from the SSU map over existing sequences in sketches, but only for prokaryotic organisms. Requires a taxonomy tree. Default: false.
- SSUMapOnly=f
- When true, use only SSU sequences from the SSU map, replacing all existing SSU sequences in sketches for all organisms. Default: false.
- SSUMapOnlyEuks=f
- When true, use only SSU sequences from the SSU map for eukaryotic organisms, clearing existing SSU sequences first. Requires a taxonomy tree. Default: false.
- SSUMapOnlyProks=f
- When true, use only SSU sequences from the SSU map for prokaryotic organisms, clearing existing SSU sequences first. Requires a taxonomy tree. Default: false.
- clear16S=f
- When true, remove all existing 16S sequences from all sketches. Default: false.
- clear18S=f
- When true, remove all existing 18S sequences from all sketches. Default: false.
- clear16SEuks=f
- When true, remove existing 16S sequences from eukaryotic organisms only. Requires a taxonomy tree. Default: false.
- clear18SEuks=f
- When true, remove existing 18S sequences from eukaryotic organisms only. Requires a taxonomy tree. Default: false.
- clear16SProks=f
- When true, remove existing 16S sequences from prokaryotic organisms only. Requires a taxonomy tree. Default: false.
- clear18SProks=f
- When true, remove existing 18S sequences from prokaryotic organisms only. Requires a taxonomy tree. Default: false.
- clearAll=f
- When true, remove all existing SSU sequences (both 16S and 18S) from all sketches. Equivalent to setting both clear16S=t and clear18S=t. Default: false.
- lines=<number>
- Maximum number of lines to process from the input sketch file. Set to -1 or omit for unlimited processing. Default: unlimited.
- verbose=f
- Enable verbose output for debugging and detailed processing information. Shows SSU map loading details and per-sketch processing information. Default: false.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -da
- Disable assertions.
Examples
Basic SSU Addition
addssu.sh in=input.sketch out=output.sketch 16S=bacterial_16S.fa 18S=eukaryotic_18S.fa
Adds 16S and 18S sequences to sketches that don't already have them, using the provided fasta files.
Replace All SSU Sequences
addssu.sh in=input.sketch out=output.sketch 16S=auto 18S=auto preferSSUMap=t
Replace existing SSU sequences with those from the default SSU database for all organisms.
Clear SSU from Prokaryotes Only
addssu.sh in=input.sketch out=output.sketch tree=auto clear16SProks=t clear18SProks=t
Remove all SSU sequences from prokaryotic organisms while preserving eukaryotic SSU sequences.
Eukaryote-Specific SSU Replacement
addssu.sh in=input.sketch out=output.sketch 18S=euk_18S.fa tree=auto preferSSUMapEuks=t
Replace 18S sequences in eukaryotic organisms only, using the provided fasta file and preferring map sequences over existing ones.
Verbose Processing with Line Limit
addssu.sh in=large.sketch out=sample.sketch 16S=auto lines=1000 verbose=t
Process only the first 1000 lines of a large sketch file with detailed output for debugging.
Algorithm Details
AddSSU implements a line-by-line streaming sketch processor using the processInner() method with ByteFile.nextLine() input and ByteStreamWriter output. The tool processes sketch files through a state-machine parser that distinguishes header lines (starting with '#') from sequence data using byte-level pattern matching, managing header state transitions via the SketchHeader inner class for atomic sketch metadata reconstruction.
Core Processing Architecture
The algorithm employs a line-by-line streaming processor with header state management:
- Line Classification: Input lines are classified as header lines (starting with '#') or sequence data using byte-level pattern matching
- Header State Machine: SketchHeader objects are created when "#SZ:" lines are encountered, accumulating subsequent "#16S:" and "#18S:" lines via the addLine() method
- Atomic Processing: Complete headers are buffered using the SketchHeader class and processed atomically via processHeader() before output generation
- ByteBuilder Output: Headers are reconstructed using the ByteBuilder class through the toBytes() method with tab-separated field formatting
Taxonomy-Aware Processing
Organism-specific processing uses TaxTree integration with TaxID-based lookups:
- TaxID Extraction: parseTaxID() method uses Tools.tabPattern.split() to parse header fields, scanning for "ID:" or "TAXID:" prefixes with Integer.parseInt() conversion
- Prokaryote Detection: TaxTree.isProkaryote(header.tid) with tid>0 && tid<SketchObject.minFakeID validation identifies bacterial and archaeal organisms for conditional processing
- Eukaryote Detection: TaxTree.isEukaryote(header.tid) with tid>0 && tid<SketchObject.minFakeID validation identifies eukaryotic organisms with specialized SSU handling
- Fake ID Filtering: TaxIDs are validated against SketchObject.minFakeID threshold using comparison operators to exclude synthetic identifiers
SSU Map Integration
The SSUMap class provides HashMap-based access to curated rRNA sequences with automatic loading:
- Lazy Loading: SSUMap.load(outstream) is called once via processInner() line 324, initializing r16SMap and r18SMap HashMaps with synchronized access
- HashMap Retrieval: SSUMap.r16SMap.get(header.tid) and SSUMap.r18SMap.get(header.tid) provide O(1) TaxID-based sequence lookup using Integer keys
- Preference Logic: preferMap boolean variable controls whether map sequences override existing header.r16S and header.r18S byte arrays through conditional assignment
- Auto File Resolution: "auto" string parameters trigger TaxTree.default16SFile() and TaxTree.default18SFile() static method calls for path resolution in lines 164-165
Memory Management Strategy
Processing uses streaming architecture with controlled memory allocation:
- Single-Pass Processing: ByteFile.nextLine() provides streaming input reading one line at a time without loading entire sketch files into memory
- Header Buffering: SketchHeader instances temporarily buffer complete headers using ArrayList<String> fields collection initialized with line.length()+2 capacity
- ByteBuilder Efficiency: Output reconstruction uses ByteBuilder(1000) constructor with pre-allocated 1000-byte capacity for tab-separated string concatenation
- Line Limiting: maxLines parameter (default Long.MAX_VALUE) enables early termination via break statement when linesProcessed>=maxLines
Header Reconstruction Algorithm
The toBytes() method reconstructs sketch headers with precise formatting preservation:
- Field Preservation: Original header fields are maintained in ArrayList iteration order using bb.tab().append(fields.get(i)) with tab separator insertion
- SSU Length Calculation: 16S and 18S length fields are dynamically calculated using r16S.length and r18S.length properties appended as "16S:" + length format
- Sequence Line Generation: "#16S:" and "#18S:" prefixed lines are appended using ByteBuilder.nl().append("#16S:").append(r16S) method chaining
- Format Compatibility: Output maintains exact sketch file format using '#' prefix, tab separation, and newline generation for downstream tool compatibility
Processing Statistics Tracking
The algorithm maintains comprehensive processing counters for operational monitoring:
- Input Tracking: r16Sin and r18Sin long counters increment via r16Sin++ and r18Sin++ when parsing existing SSU sequences from addLine() method
- Map Addition Tracking: r16SfromMap and r18SfromMap long counters increment via r16SfromMap++ and r18SfromMap++ when sequences are added from SSUMap HashMap lookups
- Output Verification: r16Sout and r18Sout long counters track final SSU content using ternary operators (header.r16S==null ? 0 : 1) for null-safe counting
- Byte/Line Accounting: bytesProcessed, linesProcessed, bytesOut, and linesOut provide throughput metrics via += operators with line.length+1 calculations
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Guide: BBSketchGuide.txt