MergeSketch

Script: mergesketch.sh Package: sketch Class: MergeSketch.java

Merges multiple sketches into a single sketch using union operations on sketch data structures. Supports both explicit file lists and wildcard patterns for input sketches.

Basic Usage

mergesketch.sh in=a.sketch,b.sketch out=c.sketch
mergesketch.sh *.sketch out=merged.sketch

Parameters

Parameters are organized by their function in the sketch merging process. All parameters from the shell script usage() function are documented exactly as they appear.

Standard parameters

in=<file>
Input sketches or fasta files; may be a comma-delimited list. The in= prefix is optional when using wildcards, allowing patterns like *.sketch to be used directly.
out=<file>
Output sketch file. The merged sketch will contain the union of all k-mers from input sketches.
amino=f
Use amino acid mode for sketch generation. Set to true (t) when working with protein sequences instead of nucleotides.

Sketch-making parameters

mode=single
Processing mode for fasta input files. Options: 'single' generates one sketch per file, 'sequence' generates one sketch per individual sequence within files.
autosize=t
Produce an output sketch of whatever size the union happens to be, without restriction. When false, size limits are applied.
size=
Restrict output sketch to this upper bound of size. Limits the number of k-mers retained in the final merged sketch.
k=32,24
K-mer length for sketch generation, range 1-32. Multiple values can be specified. Default uses both 32-mers and 24-mers.
keyfraction=0.2
Only consider this upper fraction of keyspace when generating sketches. Reduces sketch size by sampling k-mers.
minkeycount=1
Ignore k-mers that occur fewer times than this threshold. Values over 1 can be used with raw reads to avoid error k-mers.
depth=f
Retain k-mer counts if available in the input sketches. When true, preserves count information during merging.

Metadata parameters

If blank, the values of the first input sketch will be used as defaults.

taxid=-1
Set the NCBI taxonomic identifier for the merged sketch. Default -1 indicates no taxonomic assignment.
imgid=-1
Set the IMG (Integrated Microbial Genomes) database identifier. Default -1 indicates no IMG assignment.
spid=-1
Set the JGI sequencing project identifier. Default -1 indicates no project assignment.
name=
Set the taxonomic name (taxname) for the merged sketch. Overrides the name from the first input sketch.
name0=
Set name0 field, normally derived from the first sequence header. Overrides the value from the first input sketch.
fname=
Set filename field, normally derived from the original file name. Overrides the value from the first input sketch.
meta_=
Set arbitrary metadata fields using the format meta_FieldName=Value. For example, meta_Month=March adds a Month field with value March.

Java Parameters

-Xmx
Set Java's maximum memory usage, overriding autodetection. Examples: -Xmx20g specifies 20 GB of RAM, -Xmx200m specifies 200 MB. The maximum is typically 85% of available physical memory.
-eoom
Exit on out-of-memory exceptions rather than attempting recovery. Requires Java 8u92 or later.
-da
Disable Java assertions for potentially improved performance in production environments.

Examples

Basic Sketch Merging

mergesketch.sh in=sketch1.sketch,sketch2.sketch out=combined.sketch

Merges two sketches into a single output sketch, preserving metadata from the first sketch.

Wildcard Pattern Merging

mergesketch.sh *.sketch out=all_merged.sketch

Merges all sketch files in the current directory using wildcard expansion.

Size-Limited Merging with Metadata

mergesketch.sh in=*.sketch out=limited.sketch size=10000 name=MergedDataset taxid=12345

Merges sketches with size limitation and custom metadata assignment.

Amino Acid Mode Merging

mergesketch.sh in=protein1.sketch,protein2.sketch out=proteins.sketch amino=t

Merges protein sketches using amino acid k-mers instead of nucleotide k-mers.

Count-Preserving Merge

mergesketch.sh *.sketch out=counts.sketch depth=t autosize=f size=50000

Merges sketches while preserving k-mer count information and applying size restrictions.

Algorithm Details

Union Operation: MergeSketch implements sketch merging using SketchHeap.add() to combine k-mer sets from multiple input sketches. The algorithm loads sketches via SketchTool.loadSketches_MT() multithreaded loading, then creates a SketchHeap with configurable size and count tracking.

Size Calculation: Output sketch size is determined by Sketch.AUTOSIZE flag and Tools.min() comparison:

Metadata Processing: Metadata preservation follows first-sketch priority with override capability. The algorithm uses inSketches.get(0).meta as baseline, then ArrayList.addAll(outMeta) for user overrides. Custom metadata parsing uses startsWith("meta_") string matching and substring() extraction with colon delimiter.

K-mer Count Integration: Count preservation is controlled by SketchHeap constructor's trackCounts parameter. When enabled, the heap maintains k-mer frequencies during union operations, allowing quantitative analysis through the merged sketch's count data.

Memory Optimization: The implementation uses efficient data structures for large-scale merging operations. Memory usage scales with the size of the output sketch rather than the sum of input sketches, making it suitable for merging many large sketches.

File Format Compatibility: The tool can merge both existing sketch files and generate new sketches from FASTA input, providing flexibility in workflow integration.

Performance Considerations

Memory Usage: Memory requirements depend on the final sketch size rather than the number or size of input sketches. Default memory allocation is 4GB via calcXmx() with freeRam 3200m 84% calculation, adjustable using -Xmx parameter.

Processing Speed: Merging performance scales linearly with input count through SketchHeap.add() calls. Heap operations use priority queue algorithms with O(log n) insertion complexity per k-mer.

Output Size Prediction: Final sketch size cannot exceed union of unique k-mers across inputs (calculated as sum+=sk.length()), but may be smaller when size restrictions apply via Tools.min(targetSketchSize, sum).

Related Tools

MergeSketch is part of the BBSketch suite of tools:

For detailed information about sketch-based analysis, consult the BBSketchGuide.txt documentation.

Support

For questions and support: