SketchBlacklist2

Script: sketchblacklist2.sh Package: sketch Class: BlacklistMaker2.java

Creates a blacklist sketch from common kmers, which occur in at least X different sketches or taxa. BlacklistMaker2 makes blacklists from sketches rather than sequences. It is advisable to make the input sketches larger than normal, e.g. sizemult=2, because new kmers will be introduced in the final sketches to replace the blacklisted kmers.

Basic Usage

sketchblacklist2.sh ref=<sketch files> out=<sketch file>
sketchblacklist2.sh *.sketch out=<sketch file>
sketchblacklist2.sh ref=taxa#.sketch out=<sketch file>

Parameters

Parameters are organized by their function in the blacklist creation process. All parameters from the shell script are documented here.

Standard parameters

ref=<file>
Sketch files to process. Can specify multiple files using wildcards or pattern matching. These are the input sketches from which common kmers will be identified for blacklisting.
out=<file>
Output filename for the resulting blacklist sketch. This sketch will contain the kmers that should be excluded from future analyses due to their high frequency across taxa.
mintaxcount=20
Retain keys occurring in at least this many taxa. Kmers that appear in fewer than this number of different taxonomic units will be discarded. Higher values create more restrictive blacklists with only the most common kmers.
length=300000
Retain at most this many keys (prioritizing high count). The maximum number of kmers to include in the final blacklist sketch. The most frequent kmers are prioritized when the limit is reached.
k=32,24
Kmer lengths, 1-32. Comma-separated list of kmer sizes to use for blacklist creation. Multiple kmer lengths allow for different levels of specificity in identifying common sequences.
mode=taxa
Counting mode for kmer frequency calculation. Options:
  • sequence: Count kmers once per sketch (treats each sketch as a single unit)
  • taxa: Count kmers once per taxonomic unit (merges sketches from same taxa, requires taxonomy files)
Taxa mode is recommended when sketches represent different strains or isolates of the same species.
name=
Set the blacklist sketch name. This name will be embedded in the output sketch metadata for identification purposes. If not specified, defaults to "blacklist".
delta=t
Delta-compress sketches. Enables delta compression to reduce file size by storing differences between consecutive kmers rather than absolute values. Default: true.
a48=t
Encode sketches as ASCII-48 rather than hex. Uses ASCII-48 encoding instead of hexadecimal for sketch storage, which can improve compatibility with some text-processing tools. Default: true.
amino=f
Amino-acid mode. When enabled, processes protein sequences instead of nucleotide sequences, using amino acid kmers for blacklist creation. Default: false.

Taxonomy-specific parameters

tree=
Specify a taxtree file. Required for taxa mode operation. On Genepool systems, use 'auto' to automatically locate the default taxonomy tree. The tree file contains the hierarchical taxonomic relationships needed for taxonomic grouping.
gi=
Specify a gitable file. Maps GI numbers to taxonomic IDs. On Genepool systems, use 'auto' to automatically locate the default GI table. Used in conjunction with taxonomy tree for sequence classification.
accession=
Specify one or more comma-delimited NCBI accession to taxid files. These files map sequence accession numbers to taxonomic IDs. On Genepool systems, use 'auto' to automatically locate default accession files. Multiple files can be specified separated by commas.
taxlevel=subspecies
Taxa hits below this rank will be promoted and merged with others. Sequences assigned to taxonomic levels below the specified level will be promoted to this level for grouping purposes. This prevents over-fragmentation at very specific taxonomic levels. Default: subspecies.
tossjunk=t
For taxa mode, discard taxonomically uninformative sequences. When enabled, removes sequences that lack proper taxonomic classification, including:
  • Sequences with no taxid assigned
  • Sequences with tax level NO_RANK
  • Sequences with parent taxid of LIFE (too broad to be informative)
This helps ensure the blacklist contains only well-characterized sequences. Default: true.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. Specify maximum heap size (e.g., -Xmx20g for 20 gigabytes, -Xmx200m for 200 megabytes). The maximum recommended value is typically 85% of physical memory. Default calculation uses freeRam(4000m, 84) method resulting in ~31GB allocation.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Provides cleaner termination when memory is exhausted rather than hanging or producing corrupted output. Requires Java 8u92 or later.
-da
Disable assertions. Removes runtime assertion checks to improve performance in production environments. Should only be used when the tool has been thoroughly tested with the specific dataset and parameters.

Examples

Basic Blacklist Creation

# Create blacklist from all sketches in directory
sketchblacklist2.sh *.sketch out=common_kmers.sketch

# Create blacklist with specific parameters
sketchblacklist2.sh ref=bacterial_sketches.sketch out=bacterial_blacklist.sketch \
    mintaxcount=50 length=100000

Creates a blacklist sketch from multiple input sketches, retaining kmers that appear in at least 50 taxa and limiting to 100,000 most common kmers.

Taxonomy-Aware Blacklist

# Create taxonomically-informed blacklist
sketchblacklist2.sh ref=ncbi_sketches.sketch out=taxonomy_blacklist.sketch \
    mode=taxa tree=auto accession=auto \
    mintaxcount=25 taxlevel=species tossjunk=t

Uses taxonomic information to group sketches by species before identifying common kmers, requiring presence in at least 25 species.

High-Memory Processing

# Process large sketch collections with increased memory
sketchblacklist2.sh ref=large_collection/*.sketch out=large_blacklist.sketch \
    length=500000 mintaxcount=100 -Xmx64g -eoom

Processes a large collection of sketches with 64GB memory allocation and automatic termination on out-of-memory conditions.

Custom Kmer Lengths

# Use multiple kmer lengths for comprehensive blacklisting
sketchblacklist2.sh ref=input_sketches.sketch out=multi_k_blacklist.sketch \
    k=16,24,32 mintaxcount=30 name="Multi-kmer_Blacklist"

Creates a blacklist using multiple kmer lengths (16, 24, 32) with a custom name for the output sketch.

Algorithm Details

BlacklistMaker2 implements a multi-threaded approach using HashMap arrays distributed across 63 ways for identifying commonly occurring kmers across large sketch collections:

Core Algorithm

Processing Strategy

Output Generation

Performance Characteristics

Implementation Details

Notes and Recommendations

Related Tools

Support

For questions and support: