MergeSketch
Merges multiple sketches into a single sketch using union operations on sketch data structures. Supports both explicit file lists and wildcard patterns for input sketches.
Basic Usage
mergesketch.sh in=a.sketch,b.sketch out=c.sketch
mergesketch.sh *.sketch out=merged.sketch
Parameters
Parameters are organized by their function in the sketch merging process. All parameters from the shell script usage() function are documented exactly as they appear.
Standard parameters
- in=<file>
- Input sketches or fasta files; may be a comma-delimited list. The in= prefix is optional when using wildcards, allowing patterns like *.sketch to be used directly.
- out=<file>
- Output sketch file. The merged sketch will contain the union of all k-mers from input sketches.
- amino=f
- Use amino acid mode for sketch generation. Set to true (t) when working with protein sequences instead of nucleotides.
Sketch-making parameters
- mode=single
- Processing mode for fasta input files. Options: 'single' generates one sketch per file, 'sequence' generates one sketch per individual sequence within files.
- autosize=t
- Produce an output sketch of whatever size the union happens to be, without restriction. When false, size limits are applied.
- size=
- Restrict output sketch to this upper bound of size. Limits the number of k-mers retained in the final merged sketch.
- k=32,24
- K-mer length for sketch generation, range 1-32. Multiple values can be specified. Default uses both 32-mers and 24-mers.
- keyfraction=0.2
- Only consider this upper fraction of keyspace when generating sketches. Reduces sketch size by sampling k-mers.
- minkeycount=1
- Ignore k-mers that occur fewer times than this threshold. Values over 1 can be used with raw reads to avoid error k-mers.
- depth=f
- Retain k-mer counts if available in the input sketches. When true, preserves count information during merging.
Metadata parameters
If blank, the values of the first input sketch will be used as defaults.
- taxid=-1
- Set the NCBI taxonomic identifier for the merged sketch. Default -1 indicates no taxonomic assignment.
- imgid=-1
- Set the IMG (Integrated Microbial Genomes) database identifier. Default -1 indicates no IMG assignment.
- spid=-1
- Set the JGI sequencing project identifier. Default -1 indicates no project assignment.
- name=
- Set the taxonomic name (taxname) for the merged sketch. Overrides the name from the first input sketch.
- name0=
- Set name0 field, normally derived from the first sequence header. Overrides the value from the first input sketch.
- fname=
- Set filename field, normally derived from the original file name. Overrides the value from the first input sketch.
- meta_=
- Set arbitrary metadata fields using the format meta_FieldName=Value. For example, meta_Month=March adds a Month field with value March.
Java Parameters
- -Xmx
- Set Java's maximum memory usage, overriding autodetection. Examples: -Xmx20g specifies 20 GB of RAM, -Xmx200m specifies 200 MB. The maximum is typically 85% of available physical memory.
- -eoom
- Exit on out-of-memory exceptions rather than attempting recovery. Requires Java 8u92 or later.
- -da
- Disable Java assertions for potentially improved performance in production environments.
Examples
Basic Sketch Merging
mergesketch.sh in=sketch1.sketch,sketch2.sketch out=combined.sketch
Merges two sketches into a single output sketch, preserving metadata from the first sketch.
Wildcard Pattern Merging
mergesketch.sh *.sketch out=all_merged.sketch
Merges all sketch files in the current directory using wildcard expansion.
Size-Limited Merging with Metadata
mergesketch.sh in=*.sketch out=limited.sketch size=10000 name=MergedDataset taxid=12345
Merges sketches with size limitation and custom metadata assignment.
Amino Acid Mode Merging
mergesketch.sh in=protein1.sketch,protein2.sketch out=proteins.sketch amino=t
Merges protein sketches using amino acid k-mers instead of nucleotide k-mers.
Count-Preserving Merge
mergesketch.sh *.sketch out=counts.sketch depth=t autosize=f size=50000
Merges sketches while preserving k-mer count information and applying size restrictions.
Algorithm Details
Union Operation: MergeSketch implements sketch merging using SketchHeap.add() to combine k-mer sets from multiple input sketches. The algorithm loads sketches via SketchTool.loadSketches_MT() multithreaded loading, then creates a SketchHeap with configurable size and count tracking.
Size Calculation: Output sketch size is determined by Sketch.AUTOSIZE flag and Tools.min() comparison:
- Autosize mode (autosize=t): Size equals sum of all input k-mer counts, calculated in process() method as sum+=sk.length()
- Fixed size mode (autosize=f): Size limited by Tools.min(Sketch.targetSketchSize, sum) to prevent memory overflow
Metadata Processing: Metadata preservation follows first-sketch priority with override capability. The algorithm uses inSketches.get(0).meta as baseline, then ArrayList.addAll(outMeta) for user overrides. Custom metadata parsing uses startsWith("meta_") string matching and substring() extraction with colon delimiter.
K-mer Count Integration: Count preservation is controlled by SketchHeap constructor's trackCounts parameter. When enabled, the heap maintains k-mer frequencies during union operations, allowing quantitative analysis through the merged sketch's count data.
Memory Optimization: The implementation uses efficient data structures for large-scale merging operations. Memory usage scales with the size of the output sketch rather than the sum of input sketches, making it suitable for merging many large sketches.
File Format Compatibility: The tool can merge both existing sketch files and generate new sketches from FASTA input, providing flexibility in workflow integration.
Performance Considerations
Memory Usage: Memory requirements depend on the final sketch size rather than the number or size of input sketches. Default memory allocation is 4GB via calcXmx() with freeRam 3200m 84% calculation, adjustable using -Xmx parameter.
Processing Speed: Merging performance scales linearly with input count through SketchHeap.add() calls. Heap operations use priority queue algorithms with O(log n) insertion complexity per k-mer.
Output Size Prediction: Final sketch size cannot exceed union of unique k-mers across inputs (calculated as sum+=sk.length()), but may be smaller when size restrictions apply via Tools.min(targetSketchSize, sum).
Related Tools
MergeSketch is part of the BBSketch suite of tools:
- sketch.sh: Generate sketches from sequence files
- comparesketch.sh: Compare sketches for similarity analysis
- sendsketch.sh: Query sketches against online databases
For detailed information about sketch-based analysis, consult the BBSketchGuide.txt documentation.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Guides: BBSketchGuide.txt