SubSketch
Shrinks sketches to a smaller fixed length. This tool generates smaller sketches from input sketches, either by resizing to a fixed size or using autosize based on genome characteristics.
Basic Usage
subsketch.sh in=file.sketch out=sub.sketch size=1000 autosize=f
For bulk operations with multiple files:
subsketch.sh in=big#.sketch out=small#.sketch sizemult=0.5
Parameters
Parameters control sketch resizing behavior, input/output handling, and file distribution options. All parameters from the shell script are organized by their functional groups.
Standard parameters
- in=<file>
- Input sketch file containing one or more sketches. Can contain comma-separated list of files or use # symbol for numbered file sets.
- out=<file>
- Output sketch file. Can use # symbol for numbered output files when combined with files parameter.
- size=10000
- Size of sketches to generate, if autosize=f. This sets the target number of kmers in the output sketch.
- autosize=t
- Autosize sketches based on genome size. When true, the tool automatically determines optimal sketch size based on estimated genome characteristics.
- sizemult=1
- Adjust default sketch autosize by this factor. Multiplier applied to automatically calculated sketch sizes when autosize=t.
- blacklist=
- Apply a blacklist to the sketch before resizing. Removes specified kmers from sketches before size adjustment.
- files=31
- If the output filename contains a # symbol, spread the output across this many files, replacing the # with a number. Useful for distributing large sketch collections.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Fixed Size Sketch Reduction
subsketch.sh in=large.sketch out=small.sketch size=1000 autosize=f
Reduces all sketches in large.sketch to exactly 1000 kmers each.
Autosize with Multiplier
subsketch.sh in=genomes.sketch out=reduced.sketch autosize=t sizemult=0.5
Automatically sizes sketches based on genome characteristics but uses half the default size.
Bulk Processing with File Distribution
subsketch.sh in=big#.sketch out=small#.sketch files=10 sizemult=0.3
Processes multiple numbered input files (big1.sketch, big2.sketch, etc.) and distributes output across 10 files.
Blacklist Application
subsketch.sh in=contaminated.sketch out=clean.sketch blacklist=contaminants.txt size=5000
Removes blacklisted kmers before reducing sketch size to 5000.
Algorithm Details
Sketch Resizing Strategy
SubSketch implements a sequential processing approach with three distinct phases:
- Blacklist Application: If specified, calls sk.applyBlacklist() method to remove blacklisted kmers before size calculation
- Size Calculation: Determines target size via toSketchSize() method using either fixed targetSketchSize parameter or autosize calculation based on genomeSizeBases, genomeSizeKmers, and genomeSizeEstimate
- Conditional Resizing: Only applies sk.resize(target) method when sk.length() > target, preserving sketches that are already smaller than target size
Autosize Algorithm
When autosize=t, the tool calculates optimal sketch size using:
- Genome size in bases (genomeSizeBases)
- Genome size in kmers (genomeSizeKmers)
- Genome size estimate (genomeSizeEstimate)
- Size multiplier (sizemult) for fine-tuning
Multi-file Distribution
For large sketch collections, the tool supports distribution across multiple output files via the processInner() method with ByteStreamWriter array:
- Uses modulo operation sk.sketchID % files to assign sketches to specific output files
- Updates file names in sketch metadata using sk.setFname(bsw.fname) to reflect new locations
- Processes sketches sequentially through single-threaded toBytes() serialization
Memory Management
The implementation manages memory through specific mechanisms:
- Single ByteBuilder instance reused via bb.clear() method for all sketch serializations
- Sequential sketch loading using tool.loadSketches_MT() method for bounded memory usage
- Conditional multithreading controlled by allowMultithreadedFastq flag and buffer length capping via Shared.capBufferLen(40)
- Minimum sketch size filtering using minSketchSize threshold to skip processing sketches below sk.length() >= minSketchSize
Data Preservation
During sketch processing, the tool preserves specific data elements:
- Preserves genomeSizeBases, genomeSizeKmers, and genomeSizeEstimate metadata fields during toBytes() serialization
- Maintains sketch sketchID field for file distribution calculations
- Updates fname metadata using setFname() method when distributing across multiple files
- Applies minSketchSize threshold check (sk.length() >= minSketchSize) before output to filter undersized sketches
Performance Considerations
Memory Usage
Memory requirements scale with sketch collection size. Default memory allocation is 4GB (-Xmx4g), but large collections may require more:
- Small collections (<1000 sketches): 2-4GB sufficient
- Medium collections (1000-10000 sketches): 8-16GB recommended
- Large collections (>10000 sketches): Consider file distribution
Processing Speed
Performance factors include:
- Number of sketches in collection
- Original sketch sizes vs target sizes
- Blacklist complexity
- File I/O patterns (single vs distributed output)
Scalability
For very large datasets:
- Use file distribution (files parameter) to manage memory
- Consider batch processing for extremely large collections
- Monitor blacklist application performance
- Adjust Java heap size based on collection characteristics
Technical Notes
Compatibility
SubSketch works with sketch files generated by other BBTools sketch utilities including:
- sketch.sh - Primary sketch generation tool
- sendsketch.sh - Database querying tool
- comparesketch.sh - Pairwise comparison tool
File Format
Processes standard BBTools sketch format files:
- Binary sketch files (.sketch extension)
- Compressed sketch files supported
- Multiple sketches per file supported
- Preserves all sketch metadata
Error Handling
The tool includes robust error handling for:
- Invalid sketch files
- Insufficient memory conditions
- File system errors
- Malformed parameters
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Guide: Read bbtools/docs/guides/BBSketchGuide.txt for comprehensive sketch usage information