AddSSU

Script: addssu.sh Package: sketch Class: AddSSU.java

Adds, removes, or replaces SSU sequence of existing sketches. Sketches and SSU fasta files must be annotated with TaxIDs.

Basic Usage

addssu.sh in=a.sketch out=b.sketch 16S=16S.fa 18S=18S.fa

This tool modifies existing sketch files by adding, removing, or replacing Small Subunit ribosomal RNA (SSU) sequences through line-by-line streaming processing. Input sketches must contain "#SZ:" header lines with taxonomic metadata, while SSU fasta files require sequence identifiers in the "tid|#" format where # represents the NCBI taxonomic identifier. The tool implements a state-machine architecture using the SketchHeader inner class with ArrayList<String> field storage and ByteBuilder reconstruction via toBytes() method, processing each header atomically through the processHeader() method based on TaxID matching via parseTaxID().

Parameters

Parameters are organized into standard input/output files, additional SSU files, processing control flags, and Java runtime settings.

Standard parameters

in=<file>
Input sketch file. Required parameter specifying the sketch file to be processed.
out=<file>
Output sketch file. Required parameter specifying where to write the modified sketch.

Additional file parameters (optional)

16S=<file>
A fasta file of 16S sequences. These should be renamed so that they start with tid|# where # is the taxID. Should not contain organelle rRNA. Can be set to "auto" to use the default 16S database.
18S=<file>
A fasta file of 18S sequences. These should be renamed so that they start with tid|# where # is the taxID. Should not contain organelle rRNA. Can be set to "auto" to use the default 18S database.
tree=auto
Path to TaxTree, if performing prokaryote/eukaryote-specific operations. Set to "auto" to use the default taxonomy tree. Required for taxonomy-aware processing flags.

Processing parameters

preferSSUMap=f
When true, prefer SSU sequences from the SSU map over existing sequences in sketches for all organisms. Default: false.
preferSSUMapEuks=f
When true, prefer SSU sequences from the SSU map over existing sequences in sketches, but only for eukaryotic organisms. Requires a taxonomy tree. Default: false.
preferSSUMapProks=f
When true, prefer SSU sequences from the SSU map over existing sequences in sketches, but only for prokaryotic organisms. Requires a taxonomy tree. Default: false.
SSUMapOnly=f
When true, use only SSU sequences from the SSU map, replacing all existing SSU sequences in sketches for all organisms. Default: false.
SSUMapOnlyEuks=f
When true, use only SSU sequences from the SSU map for eukaryotic organisms, clearing existing SSU sequences first. Requires a taxonomy tree. Default: false.
SSUMapOnlyProks=f
When true, use only SSU sequences from the SSU map for prokaryotic organisms, clearing existing SSU sequences first. Requires a taxonomy tree. Default: false.
clear16S=f
When true, remove all existing 16S sequences from all sketches. Default: false.
clear18S=f
When true, remove all existing 18S sequences from all sketches. Default: false.
clear16SEuks=f
When true, remove existing 16S sequences from eukaryotic organisms only. Requires a taxonomy tree. Default: false.
clear18SEuks=f
When true, remove existing 18S sequences from eukaryotic organisms only. Requires a taxonomy tree. Default: false.
clear16SProks=f
When true, remove existing 16S sequences from prokaryotic organisms only. Requires a taxonomy tree. Default: false.
clear18SProks=f
When true, remove existing 18S sequences from prokaryotic organisms only. Requires a taxonomy tree. Default: false.
clearAll=f
When true, remove all existing SSU sequences (both 16S and 18S) from all sketches. Equivalent to setting both clear16S=t and clear18S=t. Default: false.
lines=<number>
Maximum number of lines to process from the input sketch file. Set to -1 or omit for unlimited processing. Default: unlimited.
verbose=f
Enable verbose output for debugging and detailed processing information. Shows SSU map loading details and per-sketch processing information. Default: false.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-da
Disable assertions.

Examples

Basic SSU Addition

addssu.sh in=input.sketch out=output.sketch 16S=bacterial_16S.fa 18S=eukaryotic_18S.fa

Adds 16S and 18S sequences to sketches that don't already have them, using the provided fasta files.

Replace All SSU Sequences

addssu.sh in=input.sketch out=output.sketch 16S=auto 18S=auto preferSSUMap=t

Replace existing SSU sequences with those from the default SSU database for all organisms.

Clear SSU from Prokaryotes Only

addssu.sh in=input.sketch out=output.sketch tree=auto clear16SProks=t clear18SProks=t

Remove all SSU sequences from prokaryotic organisms while preserving eukaryotic SSU sequences.

Eukaryote-Specific SSU Replacement

addssu.sh in=input.sketch out=output.sketch 18S=euk_18S.fa tree=auto preferSSUMapEuks=t

Replace 18S sequences in eukaryotic organisms only, using the provided fasta file and preferring map sequences over existing ones.

Verbose Processing with Line Limit

addssu.sh in=large.sketch out=sample.sketch 16S=auto lines=1000 verbose=t

Process only the first 1000 lines of a large sketch file with detailed output for debugging.

Algorithm Details

AddSSU implements a line-by-line streaming sketch processor using the processInner() method with ByteFile.nextLine() input and ByteStreamWriter output. The tool processes sketch files through a state-machine parser that distinguishes header lines (starting with '#') from sequence data using byte-level pattern matching, managing header state transitions via the SketchHeader inner class for atomic sketch metadata reconstruction.

Core Processing Architecture

The algorithm employs a line-by-line streaming processor with header state management:

Taxonomy-Aware Processing

Organism-specific processing uses TaxTree integration with TaxID-based lookups:

SSU Map Integration

The SSUMap class provides HashMap-based access to curated rRNA sequences with automatic loading:

Memory Management Strategy

Processing uses streaming architecture with controlled memory allocation:

Header Reconstruction Algorithm

The toBytes() method reconstructs sketch headers with precise formatting preservation:

Processing Statistics Tracking

The algorithm maintains comprehensive processing counters for operational monitoring:

Support

For questions and support: