SubSketch

Script: subsketch.sh Package: sketch Class: SubSketch.java

Shrinks sketches to a smaller fixed length. This tool generates smaller sketches from input sketches, either by resizing to a fixed size or using autosize based on genome characteristics.

Basic Usage

subsketch.sh in=file.sketch out=sub.sketch size=1000 autosize=f

For bulk operations with multiple files:

subsketch.sh in=big#.sketch out=small#.sketch sizemult=0.5

Parameters

Parameters control sketch resizing behavior, input/output handling, and file distribution options. All parameters from the shell script are organized by their functional groups.

Standard parameters

in=<file>
Input sketch file containing one or more sketches. Can contain comma-separated list of files or use # symbol for numbered file sets.
out=<file>
Output sketch file. Can use # symbol for numbered output files when combined with files parameter.
size=10000
Size of sketches to generate, if autosize=f. This sets the target number of kmers in the output sketch.
autosize=t
Autosize sketches based on genome size. When true, the tool automatically determines optimal sketch size based on estimated genome characteristics.
sizemult=1
Adjust default sketch autosize by this factor. Multiplier applied to automatically calculated sketch sizes when autosize=t.
blacklist=
Apply a blacklist to the sketch before resizing. Removes specified kmers from sketches before size adjustment.
files=31
If the output filename contains a # symbol, spread the output across this many files, replacing the # with a number. Useful for distributing large sketch collections.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Fixed Size Sketch Reduction

subsketch.sh in=large.sketch out=small.sketch size=1000 autosize=f

Reduces all sketches in large.sketch to exactly 1000 kmers each.

Autosize with Multiplier

subsketch.sh in=genomes.sketch out=reduced.sketch autosize=t sizemult=0.5

Automatically sizes sketches based on genome characteristics but uses half the default size.

Bulk Processing with File Distribution

subsketch.sh in=big#.sketch out=small#.sketch files=10 sizemult=0.3

Processes multiple numbered input files (big1.sketch, big2.sketch, etc.) and distributes output across 10 files.

Blacklist Application

subsketch.sh in=contaminated.sketch out=clean.sketch blacklist=contaminants.txt size=5000

Removes blacklisted kmers before reducing sketch size to 5000.

Algorithm Details

Sketch Resizing Strategy

SubSketch implements a sequential processing approach with three distinct phases:

Autosize Algorithm

When autosize=t, the tool calculates optimal sketch size using:

Multi-file Distribution

For large sketch collections, the tool supports distribution across multiple output files via the processInner() method with ByteStreamWriter array:

Memory Management

The implementation manages memory through specific mechanisms:

Data Preservation

During sketch processing, the tool preserves specific data elements:

Performance Considerations

Memory Usage

Memory requirements scale with sketch collection size. Default memory allocation is 4GB (-Xmx4g), but large collections may require more:

Processing Speed

Performance factors include:

Scalability

For very large datasets:

Technical Notes

Compatibility

SubSketch works with sketch files generated by other BBTools sketch utilities including:

File Format

Processes standard BBTools sketch format files:

Error Handling

The tool includes robust error handling for:

Support

For questions and support: