SketchBlacklist

Basic Usage

sketchblacklist.sh in=<fasta file> out=<sketch file>

Parameters

Parameters are organized according to their function in blacklist generation. The tool supports both sequence-based and taxonomy-aware modes for identifying common kmers across datasets.

Standard parameters

in=<file>: A fasta file containing one or more sequences. Input sequences will be processed to identify common kmers based on the specified mode and threshold.
out=<file>: Output filename for the blacklist sketch file. The sketch will contain kmers that occur frequently enough to warrant blacklisting in downstream analyses.
mintaxcount=100: Sketch kmers occuring in at least this many taxa. This threshold determines which kmers are common enough to include in the blacklist. Higher values create more stringent blacklists.
k=31: Kmer length, 1-32. To maximize sensitivity and specificity, dual kmer lengths may be used: k=31,24. Longer kmers provide higher specificity but may miss some common sequences.
mode=sequence: Possible modes: sequence: Count kmers once per sequence. taxa: Count kmers once per taxonomic unit. The taxa mode requires taxonomic information and creates more biologically meaningful blacklists.
name=: Set the blacklist sketch name. This name will be embedded in the sketch file for identification purposes. If not specified, the output filename will be used.
delta=t: Delta-compress sketches. This compression technique reduces sketch size by storing only the differences between consecutive hash values, reducing storage requirements.
a48=t: Encode sketches as ASCII-48 rather than hex. This encoding uses printable ASCII characters starting at character 48 ('0') for better text compatibility.
amino=f: Amino-acid mode. When set to true, processes protein sequences using amino acid kmers instead of nucleotide kmers. Kmer length constraints apply differently in amino acid mode.
entropy=0.66: Ignore sequence with entropy below this value. Low-entropy sequences (like homopolymer runs) are excluded from kmer counting as they are typically uninformative for blacklisting purposes.
keyfraction=0.16: Smaller values reduce blacklist size by ignoring a fraction of the key space. Range: 0.0001-0.5. This parameter allows subsampling of the kmer space to manage memory usage and blacklist size.

Taxonomy-specific parameters

tree=: Specify a taxtree file. On Genepool, use 'auto'. The taxonomic tree is required for taxa-aware blacklist generation and determines how sequences are grouped taxonomically.
gi=: Specify a gitable file. On Genepool, use 'auto'. The GI table maps GenBank identifiers to taxonomic IDs, enabling taxonomic classification of sequences.
accession=: Specify one or more comma-delimited NCBI accession to taxid files. On Genepool, use 'auto'. These files provide alternative mappings from sequence accessions to taxonomic identifiers.
taxlevel=subspecies: Taxa hits below this rank will be promoted and merged with others. This parameter controls the taxonomic resolution for grouping sequences, with higher levels creating broader taxonomic groups.
prefilter=t: Use a bloom filter to ignore low-count kmers. The prefilter uses KCountArray with 2-bit cells and 2 hash functions to screen out rare kmers before the main counting phase, reducing memory requirements for HashMap storage.
prepasses=2: Number of prefilter passes. Multiple passes allow progressive refinement of kmer counts, with each pass using updated filterMax thresholds from previous pass estimates.
prehashes=2: Number of prefilter hashes. More hash functions reduce false positives in the KCountArray bloom filter but increase computational overhead. Default 2 hashes balances accuracy with processing speed.
prebits=-1: Manually override number of prefilter cell bits. When set to -1, the optimal number of bits per cell is automatically determined based on available memory and expected kmer counts.
tossjunk=t: For taxa mode, discard taxonomically uninformative sequences. This includes sequences with no taxid, with a tax level NO_RANK, or parent taxid of LIFE. Helps focus the blacklist on meaningful taxonomic groups.
silva=f: Parse headers using Silva or semicolon-delimited syntax. When enabled, uses Silva database header format for taxonomic parsing instead of standard NCBI format.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Blacklist generation can be memory-intensive for large datasets.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Recommended for production use to avoid corrupted output files.
-da: Disable assertions. Can provide minor performance improvements in production environments, but assertions help catch errors during development.

Examples

Basic Blacklist Creation

sketchblacklist.sh in=reference_genomes.fasta out=blacklist.sketch mintaxcount=50

Creates a blacklist sketch from reference genomes, including kmers that appear in at least 50 sequences.

Taxonomy-Aware Blacklist

sketchblacklist.sh in=ncbi_genomes.fasta out=tax_blacklist.sketch mode=taxa tree=auto gi=auto taxlevel=species mintaxcount=10

Creates a taxonomy-aware blacklist using NCBI taxonomic information, grouping sequences by species and including kmers found in at least 10 different species.

Protein Blacklist with Custom Parameters

sketchblacklist.sh in=proteins.fasta out=protein_blacklist.sketch amino=t k=15 entropy=0.5 mintaxcount=25

Creates a blacklist from protein sequences using 15-mer amino acid kmers, with relaxed entropy filtering and a moderate frequency threshold.

Memory-Optimized Large Dataset

sketchblacklist.sh in=large_dataset.fasta out=blacklist.sketch prefilter=t prepasses=3 keyfraction=0.1 -Xmx50g

Processes a large dataset with prefiltering and reduced key space sampling to manage memory usage.

Algorithm Details

SketchBlacklist implements a multi-pass kmer counting strategy via the BlacklistMaker class. The algorithm operates in two main phases: optional prefiltering using KCountArray data structures and final kmer counting using HashMap storage.

Prefiltering Strategy

When enabled, prefiltering uses KCountArray.makePrefilter_inner() with configurable cell bits (default 2 bits, determined by prebits parameter) and multiple hash functions (prehashes parameter, default 2). The prefilter allocates memory based on filterMemory() calculation using prefilterFraction=0.2 of available system memory. Cell allocation uses formula: precells = (filterMemory * 8) / cbits, with minimum threshold of 100,000 cells to proceed.

Multi-Pass Architecture

The makePrefilter_inner() method supports recursive multi-pass processing (prepasses parameter, default 2). Pass memory allocation alternates between low and high fractions using prefilterFraction=0.2: odd passes use 0.2 of available memory, even passes use 0.8 of available memory. When autoPasses is enabled and estimateUniqueKmers() returns less than 1,000,000 kmers, the algorithm terminates early at current pass.

Taxonomic Awareness

The algorithm supports three processing modes via mode parameter: PER_SEQUENCE (uses r1.numericID masked with Integer.MAX_VALUE), PER_TAXA (uses TaxTree.parseNodeFromHeader() with taxonomic level promotion), and PER_IMG (Integrated Microbial Genomes mode). Taxa below taxLevel are promoted using parent traversal: while(tn.level < taxLevel) iterate to tn.pid until reaching target taxonomic level.

Kmer Processing and Storage

Kmer storage uses HashMap<Long, IntListCompressor>[] maps array with ways=63 partitions. Kmers are distributed using hash % ways calculation. Each IntListCompressor stores taxonomic IDs for kmer occurrence tracking. Synchronization occurs at both map level (for HashMap access) and IntListCompressor level (for list modification) to support multi-threading.

Memory Management

Memory management includes calcMemory() method using Shared.memAvailableAdvanced() for available memory calculation. Thread management caps threads at Shared.threads()/2 for hyperthreaded systems to prevent memory overload. Buffer allocation uses Tools.max(4, Shared.threads()+1) for optimal buffer sizing based on thread count.

Output Generation

Final blacklist generation uses toArray() method collecting kmers where IntList.size() >= minTaxCount threshold. Results are converted to Sketch objects via toSketch() with hashArrayToSketchArray() transformation. Metadata includes minTaxCount and taxLevel parameters embedded in the sketch for downstream processing identification.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Guide: BBSketchGuide.txt