SketchBlacklist2

Basic Usage

sketchblacklist2.sh ref=<sketch files> out=<sketch file>

sketchblacklist2.sh *.sketch out=<sketch file>

sketchblacklist2.sh ref=taxa#.sketch out=<sketch file>

Parameters

Parameters are organized by their function in the blacklist creation process. All parameters from the shell script are documented here.

Standard parameters

ref=<file>

Sketch files to process. Can specify multiple files using wildcards or pattern matching. These are the input sketches from which common kmers will be identified for blacklisting.

out=<file>

Output filename for the resulting blacklist sketch. This sketch will contain the kmers that should be excluded from future analyses due to their high frequency across taxa.

mintaxcount=20

Retain keys occurring in at least this many taxa. Kmers that appear in fewer than this number of different taxonomic units will be discarded. Higher values create more restrictive blacklists with only the most common kmers.

length=300000

Retain at most this many keys (prioritizing high count). The maximum number of kmers to include in the final blacklist sketch. The most frequent kmers are prioritized when the limit is reached.

k=32,24

Kmer lengths, 1-32. Comma-separated list of kmer sizes to use for blacklist creation. Multiple kmer lengths allow for different levels of specificity in identifying common sequences.

mode=taxa

Counting mode for kmer frequency calculation. Options:

sequence: Count kmers once per sketch (treats each sketch as a single unit)
taxa: Count kmers once per taxonomic unit (merges sketches from same taxa, requires taxonomy files)

Taxa mode is recommended when sketches represent different strains or isolates of the same species.

name=

Set the blacklist sketch name. This name will be embedded in the output sketch metadata for identification purposes. If not specified, defaults to "blacklist".

delta=t

Delta-compress sketches. Enables delta compression to reduce file size by storing differences between consecutive kmers rather than absolute values. Default: true.

a48=t

Encode sketches as ASCII-48 rather than hex. Uses ASCII-48 encoding instead of hexadecimal for sketch storage, which can improve compatibility with some text-processing tools. Default: true.

amino=f

Amino-acid mode. When enabled, processes protein sequences instead of nucleotide sequences, using amino acid kmers for blacklist creation. Default: false.

Taxonomy-specific parameters

tree=

Specify a taxtree file. Required for taxa mode operation. On Genepool systems, use 'auto' to automatically locate the default taxonomy tree. The tree file contains the hierarchical taxonomic relationships needed for taxonomic grouping.

gi=

Specify a gitable file. Maps GI numbers to taxonomic IDs. On Genepool systems, use 'auto' to automatically locate the default GI table. Used in conjunction with taxonomy tree for sequence classification.

accession=

Specify one or more comma-delimited NCBI accession to taxid files. These files map sequence accession numbers to taxonomic IDs. On Genepool systems, use 'auto' to automatically locate default accession files. Multiple files can be specified separated by commas.

taxlevel=subspecies

Taxa hits below this rank will be promoted and merged with others. Sequences assigned to taxonomic levels below the specified level will be promoted to this level for grouping purposes. This prevents over-fragmentation at very specific taxonomic levels. Default: subspecies.

tossjunk=t

For taxa mode, discard taxonomically uninformative sequences. When enabled, removes sequences that lack proper taxonomic classification, including:

Sequences with no taxid assigned
Sequences with tax level NO_RANK
Sequences with parent taxid of LIFE (too broad to be informative)

This helps ensure the blacklist contains only well-characterized sequences. Default: true.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. Specify maximum heap size (e.g., -Xmx20g for 20 gigabytes, -Xmx200m for 200 megabytes). The maximum recommended value is typically 85% of physical memory. Default calculation uses freeRam(4000m, 84) method resulting in ~31GB allocation.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Provides cleaner termination when memory is exhausted rather than hanging or producing corrupted output. Requires Java 8u92 or later.
-da: Disable assertions. Removes runtime assertion checks to improve performance in production environments. Should only be used when the tool has been thoroughly tested with the specific dataset and parameters.

Examples

Basic Blacklist Creation

# Create blacklist from all sketches in directory
sketchblacklist2.sh *.sketch out=common_kmers.sketch

# Create blacklist with specific parameters
sketchblacklist2.sh ref=bacterial_sketches.sketch out=bacterial_blacklist.sketch \
    mintaxcount=50 length=100000

Creates a blacklist sketch from multiple input sketches, retaining kmers that appear in at least 50 taxa and limiting to 100,000 most common kmers.

Taxonomy-Aware Blacklist

# Create taxonomically-informed blacklist
sketchblacklist2.sh ref=ncbi_sketches.sketch out=taxonomy_blacklist.sketch \
    mode=taxa tree=auto accession=auto \
    mintaxcount=25 taxlevel=species tossjunk=t

Uses taxonomic information to group sketches by species before identifying common kmers, requiring presence in at least 25 species.

High-Memory Processing

# Process large sketch collections with increased memory
sketchblacklist2.sh ref=large_collection/*.sketch out=large_blacklist.sketch \
    length=500000 mintaxcount=100 -Xmx64g -eoom

Processes a large collection of sketches with 64GB memory allocation and automatic termination on out-of-memory conditions.

Custom Kmer Lengths

# Use multiple kmer lengths for comprehensive blacklisting
sketchblacklist2.sh ref=input_sketches.sketch out=multi_k_blacklist.sketch \
    k=16,24,32 mintaxcount=30 name="Multi-kmer_Blacklist"

Creates a blacklist using multiple kmer lengths (16, 24, 32) with a custom name for the output sketch.

Algorithm Details

BlacklistMaker2 implements a multi-threaded approach using HashMap arrays distributed across 63 ways for identifying commonly occurring kmers across large sketch collections:

Core Algorithm

Sketch Processing: Reads existing sketches rather than raw sequences, significantly reducing I/O and processing time compared to sequence-based approaches
Multi-threaded Architecture: Distributes kmer counting across available CPU cores, with each thread processing different portions of the sketch index
Hash-Based Storage: Uses HashMap arrays distributed across 63 ways with IntListCompressor for storing kmer counts
Taxonomic Integration: When in taxa mode, promotes taxonomic IDs to specified levels and merges counts from sequences belonging to the same taxonomic unit

Processing Strategy

Index-Based Processing: Iterates through the sketch index table arrays, processing all kmers from loaded sketches
Count Aggregation: For each kmer, accumulates occurrences across sketches or taxonomic units based on the selected mode
Filtering and Prioritization: Retains only kmers meeting the minimum count threshold, then selects top kmers by frequency up to the specified length limit
Quality Control: In taxa mode with tossjunk=true, automatically excludes taxonomically uninformative sequences

Output Generation

Sorted Output: Final kmer list is sorted for consistent output and optimal sketch compression
Metadata Inclusion: Output sketch includes processing parameters (minTaxCount, taxLevel) for reproducibility
Compression Options: Supports delta compression and ASCII-48 encoding for size optimization
Histogram Output: Optional histogram showing distribution of kmer occurrence counts

Performance Characteristics

Memory Usage: Default heap size allocation of 31GB (configurable via -Xmx parameter), calculated with freeRam(4000m, 84%) method
Thread Management: Caps threads to half of available cores when >=32 threads are detected to manage hyperthreading memory usage
I/O Configuration: Uses ByteFile.FORCE_MODE_BF2 for parallel I/O when thread count >2
Processing Architecture: ProcessThread distributes work across sketch index table arrays (i+=threads stride pattern)

Implementation Details

Hash Distribution: Uses 63-way hash distribution (key%ways) for HashMap array partitioning
Taxonomic ID Storage: IntListCompressor stores taxonomic ID lists per kmer with synchronized access
Thread Safety: Uses AtomicInteger nextUnknown for unknown taxonomic ID assignment (starting from minFakeID)
Key Processing: Converts keys using Long.MAX_VALUE-key0 transformation and applies minTaxCount filtering

Notes and Recommendations

Input Sketch Size: Use larger than normal input sketches (e.g., sizemult=2) since blacklisted kmers will be replaced with new ones in final sketches
Memory Requirements: Default 31GB allocation may need adjustment for large collections; monitor usage and adjust -Xmx accordingly
Taxonomic Files: Ensure taxonomy files (tree, gi, accession) are current and consistent when using taxa mode
Parameter Tuning: Adjust mintaxcount and length parameters based on your specific use case and dataset characteristics
Quality Control: Enable tossjunk=t when working with mixed-quality databases to improve blacklist accuracy

Related Tools

sketch.sh: Create input sketches from raw sequences
comparesketch.sh: Compare sketches using blacklists
sendsketch.sh: Submit sketches to remote databases
sketchblacklist.sh: Original blacklist creator (sequence-based)

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Guide: BBSketchGuide.txt