SketchBlacklist2
Creates a blacklist sketch from common kmers, which occur in at least X different sketches or taxa. BlacklistMaker2 makes blacklists from sketches rather than sequences. It is advisable to make the input sketches larger than normal, e.g. sizemult=2, because new kmers will be introduced in the final sketches to replace the blacklisted kmers.
Basic Usage
sketchblacklist2.sh ref=<sketch files> out=<sketch file>
sketchblacklist2.sh *.sketch out=<sketch file>
sketchblacklist2.sh ref=taxa#.sketch out=<sketch file>
Parameters
Parameters are organized by their function in the blacklist creation process. All parameters from the shell script are documented here.
Standard parameters
- ref=<file>
- Sketch files to process. Can specify multiple files using wildcards or pattern matching. These are the input sketches from which common kmers will be identified for blacklisting.
- out=<file>
- Output filename for the resulting blacklist sketch. This sketch will contain the kmers that should be excluded from future analyses due to their high frequency across taxa.
- mintaxcount=20
- Retain keys occurring in at least this many taxa. Kmers that appear in fewer than this number of different taxonomic units will be discarded. Higher values create more restrictive blacklists with only the most common kmers.
- length=300000
- Retain at most this many keys (prioritizing high count). The maximum number of kmers to include in the final blacklist sketch. The most frequent kmers are prioritized when the limit is reached.
- k=32,24
- Kmer lengths, 1-32. Comma-separated list of kmer sizes to use for blacklist creation. Multiple kmer lengths allow for different levels of specificity in identifying common sequences.
- mode=taxa
- Counting mode for kmer frequency calculation. Options:
- sequence: Count kmers once per sketch (treats each sketch as a single unit)
- taxa: Count kmers once per taxonomic unit (merges sketches from same taxa, requires taxonomy files)
- name=
- Set the blacklist sketch name. This name will be embedded in the output sketch metadata for identification purposes. If not specified, defaults to "blacklist".
- delta=t
- Delta-compress sketches. Enables delta compression to reduce file size by storing differences between consecutive kmers rather than absolute values. Default: true.
- a48=t
- Encode sketches as ASCII-48 rather than hex. Uses ASCII-48 encoding instead of hexadecimal for sketch storage, which can improve compatibility with some text-processing tools. Default: true.
- amino=f
- Amino-acid mode. When enabled, processes protein sequences instead of nucleotide sequences, using amino acid kmers for blacklist creation. Default: false.
Taxonomy-specific parameters
- tree=
- Specify a taxtree file. Required for taxa mode operation. On Genepool systems, use 'auto' to automatically locate the default taxonomy tree. The tree file contains the hierarchical taxonomic relationships needed for taxonomic grouping.
- gi=
- Specify a gitable file. Maps GI numbers to taxonomic IDs. On Genepool systems, use 'auto' to automatically locate the default GI table. Used in conjunction with taxonomy tree for sequence classification.
- accession=
- Specify one or more comma-delimited NCBI accession to taxid files. These files map sequence accession numbers to taxonomic IDs. On Genepool systems, use 'auto' to automatically locate default accession files. Multiple files can be specified separated by commas.
- taxlevel=subspecies
- Taxa hits below this rank will be promoted and merged with others. Sequences assigned to taxonomic levels below the specified level will be promoted to this level for grouping purposes. This prevents over-fragmentation at very specific taxonomic levels. Default: subspecies.
- tossjunk=t
- For taxa mode, discard taxonomically uninformative sequences. When enabled, removes sequences that lack proper taxonomic classification, including:
- Sequences with no taxid assigned
- Sequences with tax level NO_RANK
- Sequences with parent taxid of LIFE (too broad to be informative)
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. Specify maximum heap size (e.g., -Xmx20g for 20 gigabytes, -Xmx200m for 200 megabytes). The maximum recommended value is typically 85% of physical memory. Default calculation uses freeRam(4000m, 84) method resulting in ~31GB allocation.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Provides cleaner termination when memory is exhausted rather than hanging or producing corrupted output. Requires Java 8u92 or later.
- -da
- Disable assertions. Removes runtime assertion checks to improve performance in production environments. Should only be used when the tool has been thoroughly tested with the specific dataset and parameters.
Examples
Basic Blacklist Creation
# Create blacklist from all sketches in directory
sketchblacklist2.sh *.sketch out=common_kmers.sketch
# Create blacklist with specific parameters
sketchblacklist2.sh ref=bacterial_sketches.sketch out=bacterial_blacklist.sketch \
mintaxcount=50 length=100000
Creates a blacklist sketch from multiple input sketches, retaining kmers that appear in at least 50 taxa and limiting to 100,000 most common kmers.
Taxonomy-Aware Blacklist
# Create taxonomically-informed blacklist
sketchblacklist2.sh ref=ncbi_sketches.sketch out=taxonomy_blacklist.sketch \
mode=taxa tree=auto accession=auto \
mintaxcount=25 taxlevel=species tossjunk=t
Uses taxonomic information to group sketches by species before identifying common kmers, requiring presence in at least 25 species.
High-Memory Processing
# Process large sketch collections with increased memory
sketchblacklist2.sh ref=large_collection/*.sketch out=large_blacklist.sketch \
length=500000 mintaxcount=100 -Xmx64g -eoom
Processes a large collection of sketches with 64GB memory allocation and automatic termination on out-of-memory conditions.
Custom Kmer Lengths
# Use multiple kmer lengths for comprehensive blacklisting
sketchblacklist2.sh ref=input_sketches.sketch out=multi_k_blacklist.sketch \
k=16,24,32 mintaxcount=30 name="Multi-kmer_Blacklist"
Creates a blacklist using multiple kmer lengths (16, 24, 32) with a custom name for the output sketch.
Algorithm Details
BlacklistMaker2 implements a multi-threaded approach using HashMap arrays distributed across 63 ways for identifying commonly occurring kmers across large sketch collections:
Core Algorithm
- Sketch Processing: Reads existing sketches rather than raw sequences, significantly reducing I/O and processing time compared to sequence-based approaches
- Multi-threaded Architecture: Distributes kmer counting across available CPU cores, with each thread processing different portions of the sketch index
- Hash-Based Storage: Uses HashMap arrays distributed across 63 ways with IntListCompressor for storing kmer counts
- Taxonomic Integration: When in taxa mode, promotes taxonomic IDs to specified levels and merges counts from sequences belonging to the same taxonomic unit
Processing Strategy
- Index-Based Processing: Iterates through the sketch index table arrays, processing all kmers from loaded sketches
- Count Aggregation: For each kmer, accumulates occurrences across sketches or taxonomic units based on the selected mode
- Filtering and Prioritization: Retains only kmers meeting the minimum count threshold, then selects top kmers by frequency up to the specified length limit
- Quality Control: In taxa mode with tossjunk=true, automatically excludes taxonomically uninformative sequences
Output Generation
- Sorted Output: Final kmer list is sorted for consistent output and optimal sketch compression
- Metadata Inclusion: Output sketch includes processing parameters (minTaxCount, taxLevel) for reproducibility
- Compression Options: Supports delta compression and ASCII-48 encoding for size optimization
- Histogram Output: Optional histogram showing distribution of kmer occurrence counts
Performance Characteristics
- Memory Usage: Default heap size allocation of 31GB (configurable via -Xmx parameter), calculated with freeRam(4000m, 84%) method
- Thread Management: Caps threads to half of available cores when >=32 threads are detected to manage hyperthreading memory usage
- I/O Configuration: Uses ByteFile.FORCE_MODE_BF2 for parallel I/O when thread count >2
- Processing Architecture: ProcessThread distributes work across sketch index table arrays (i+=threads stride pattern)
Implementation Details
- Hash Distribution: Uses 63-way hash distribution (key%ways) for HashMap array partitioning
- Taxonomic ID Storage: IntListCompressor stores taxonomic ID lists per kmer with synchronized access
- Thread Safety: Uses AtomicInteger nextUnknown for unknown taxonomic ID assignment (starting from minFakeID)
- Key Processing: Converts keys using Long.MAX_VALUE-key0 transformation and applies minTaxCount filtering
Notes and Recommendations
- Input Sketch Size: Use larger than normal input sketches (e.g., sizemult=2) since blacklisted kmers will be replaced with new ones in final sketches
- Memory Requirements: Default 31GB allocation may need adjustment for large collections; monitor usage and adjust -Xmx accordingly
- Taxonomic Files: Ensure taxonomy files (tree, gi, accession) are current and consistent when using taxa mode
- Parameter Tuning: Adjust mintaxcount and length parameters based on your specific use case and dataset characteristics
- Quality Control: Enable tossjunk=t when working with mixed-quality databases to improve blacklist accuracy
Related Tools
- sketch.sh: Create input sketches from raw sequences
- comparesketch.sh: Compare sketches using blacklists
- sendsketch.sh: Submit sketches to remote databases
- sketchblacklist.sh: Original blacklist creator (sequence-based)
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Guide: BBSketchGuide.txt