BBMask

Overview

BBMask performs three types of masking, all optional and can be used individually or in combination:

Low-entropy (complexity) - Identifies repetitive or low-complexity sequences using Shannon entropy calculation
Tandem-repeated kmers - Masks exact k-mer repeats above specified frequency thresholds
SAM-file coverage - Masks regions based on read mapping coverage from alignment files

BBMask loads all sequences into memory to allow multiple masking operations and requires approximately 1 byte per base for FASTA input.

Basic Usage

bbmask.sh in=<file> out=<file> sam=<file,file,...file>

Input may be stdin or a fasta or fastq file, raw or gzipped. SAM files are optional but may be a comma-delimited list of SAM files to mask. SAM files may also be used as arguments without sam=, so you can use *.sam for example. If you pipe via stdin/stdout, please include the file type; e.g. for gzipped fasta input, set in=stdin.fa.gz

Parameters

BBMask organizes parameters into logical groups based on their function in the masking process. The tool supports three main masking strategies: repeat detection, low-complexity/entropy masking, and coverage-based masking from SAM files.

Input parameters

in=<file>: Input sequences to mask. 'in=stdin.fa' will pipe from standard in.
sam=<file,file>: Comma-delimited list of sam files. Optional. Their mapped coordinates will be masked.
touppercase=f: (tuc) Change all letters to upper-case.
interleaved=auto: (int) If true, forces fastq input to be paired and interleaved.
qin=auto: ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.

Output parameters

out=<file>: Write masked sequences here. 'out=stdout.fa' will pipe to standard out.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
fastawrap=70: Length of lines in fasta output.
qout=auto: ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).

Processing parameters

threads=auto: (t) Set number of threads to use; default is number of logical processors.
maskrepeats=f: (mr) Mask areas covered by exact repeat kmers.
kr=5: Kmer size to use for repeat detection (1-15). Use minkr and maxkr to sweep a range of kmers.
minlen=40: Minimum length of repeat area to mask.
mincount=4: Minimum number of repeats to mask.
masklowentropy=t: (mle) Mask areas with low complexity by calculating entropy over a window for a fixed kmer size.
ke=5: Kmer size to use for entropy calculation (1-15). Use minke and maxke to sweep a range. Large ke uses more memory.
window=80: (w) Window size for entropy calculation.
entropy=0.70: (e) Mask windows with entropy under this value (0-1). 0.0001 will mask only homopolymers and 1 will mask everything.
lowercase=f: (lc) Convert masked bases to lower case. Default is to convert them to N.
split=f: Split into unmasked pieces and discard masked pieces.

Coverage parameters (only relevant if sam files are specified)

mincov=-1: If nonnegative, mask bases with coverage outside this range.
maxcov=-1: If nonnegative, mask bases with coverage outside this range.
delcov=t: Include deletions when calculating coverage.

Note: If neither mincov nor maxcov are set, all covered bases will be masked.

Other parameters

pigz=t: Use pigz to compress. If argument is a number, that will set the number of pigz threads.
unpigz=t: Use pigz to decompress.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Entropy Threshold Guidelines

Entropy is calculated using Shannon Entropy of kmers in a window, and varies from 0 (mask nothing) to 1 (mask everything). It can be challenging to determine the optimal threshold, so here are reference points:

Entropy Threshold	E. coli Genome	Human Genome
0.7 (default)	107 bases masked	~0.7% masked
0.9	~8 kbp masked	~7% masked

Examples

Basic Entropy Masking

bbmask.sh in=ref.fa out=masked.fa entropy=0.7

Masks low-entropy regions using default parameters. This is the most common usage for general sequence masking.

Contamination Removal Workflow

# Step 1: Shred contaminant genome into overlapping fragments
shred.sh in=contaminant.fa out=shredded.fa length=80 minlength=70 overlap=40

# Step 2: Map fragments to target genome
bbmap.sh ref=target.fa in=shredded.fa outm=mapped.sam minid=0.85 maxindel=2

# Step 3: Mask similar regions in target genome
bbmask.sh in=target.fa out=masked.fa entropy=0.7 sam=mapped.sam

Complete workflow for masking sequences in genome A that are similar to those in genome B, plus low-entropy sequences. This approach is used at JGI for vertebrate contamination removal with zero false-positives.

Repeat Masking

bbmask.sh in=assembly.fasta out=masked.fasta mr=t kr=7 minlen=50 mincount=3

Enables repeat masking with 7-mers, requiring at least 3 repeats and minimum length of 50bp to mask.

Coverage-Based Masking

bbmask.sh in=assembly.fasta out=masked.fasta sam=reads.sam mincov=5 maxcov=100

Masks regions with coverage outside the range 5-100x using alignment data from reads.sam.

Combined Masking Strategies

bbmask.sh in=assembly.fasta out=masked.fasta mr=t mle=t sam=reads1.sam,reads2.sam

Applies all three masking strategies: repeat masking, low-entropy masking, and coverage masking from multiple SAM files.

Soft Masking

bbmask.sh in=assembly.fasta out=soft_masked.fasta lc=t entropy=0.6

Converts masked bases to lowercase instead of 'N', useful for downstream analysis that distinguishes hard vs soft masking.

Algorithm Details

Three-Strategy Masking Approach

BBMask implements three independent masking strategies that can be used individually or in combination:

1. Low-Entropy/Complexity Masking (masklowentropy=t)

Identifies low-complexity regions using Shannon entropy calculation over sliding windows via maskLowEntropy() and EntropyTracker methods:

Entropy Mode (default): Calculates Shannon entropy via EntropyTracker.passes() over sliding windows (default 80bp) using k-mers (default k=5)
Complexity Mode: Counts unique k-mers using short[] arrays (kmerspace = 1<<(2*k)) and masks regions below complexity threshold
Uses EntropyTracker objects initialized with EntropyTracker(k, windowT, false, cutoff, true) for multi-threaded calculation
Entropy cutoff of 0.70 processed through EntropyTracker.windowBases() method calls
MaskLowEntropyThread workers poll from ArrayBlockingQueue<Read> for parallel processing

2. Tandem-Repeated K-mer Masking (maskrepeats=t)

Uses exact k-mer matching via getInitialKey() and repeatLength() methods to identify repetitive sequences:

Scans sequences with sliding k-mer windows (default k=5) using maskRepeats() with bit-packed representation
Identifies k-mers that occur multiple times using kmer comparison against initial key
Marks regions containing repeated k-mers for masking using BitSet.set() operations
Only masks regions meeting minimum length (minlen=40) and count (mincount=4) thresholds
Uses bit masks (mask = ~((-1)<<(2*k))) for k≤15 via Dedupe.baseToNumber array indexing

3. SAM-File Coverage Masking

Masks regions based on read mapping coverage from SAM files via maskSam_MT() and MaskSamThread processing:

Parses SAM alignment coordinates using SamLine.start() and SamLine.stop() to build coverage arrays
Supports CoverageArray2 (16-bit) and CoverageArray3 (32-bit) based on bits32 flag and coverage thresholds
Masks high-coverage regions via setHighCoverage() and low-coverage via setLowCoverage() BitSet operations
Includes optional deletion coverage via fillRanges() with includeDeletionCoverage parameter
Uses ConcurrentHashMap<String, CoverageArray> for thread-safe coverage tracking across scaffolds

Performance Optimizations

BitSet Data Structure: Uses Java BitSet with .set(), .get(), and .cardinality() operations for memory-efficient masking coordinate storage
Multi-threading: MaskRepeatThread and MaskLowEntropyThread classes for parallel processing with Shared.threads() allocation
Memory Management: Dynamic CoverageArray selection via bits32=(mincov>=Character.MAX_VALUE || maxcov>=Character.MAX_VALUE)
Streaming I/O: ConcurrentReadInputStream.getReadInputStream() and ConcurrentReadOutputStream.getStream() for efficient file processing
K-mer Optimization: Bit-packed representation using ((kmer<<2)|n)&mask operations for k≤15 reduces memory footprint

Masking Output Options

Hard Masking (default): Replaces masked bases with 'N' via maskRead() method setting bases[i]='N'
Soft Masking (lowercase=t): Converts masked bases to lowercase via Tools.toLowerCase() for downstream analysis
Split Mode (split=t): Removes masked regions via splitFromBitsets() using KillSwitch.copyOfRange() to output unmasked fragments

Memory Requirements

BBMask requires approximately 1 byte per base for FASTA input. Memory usage scales with:

Input sequence length (BitSet allocation via new BitSet(len))
K-mer size for entropy/repeat detection (kmerspace = (1<<(2*k)) array allocation)
Number of SAM files and coverage depth (ConcurrentHashMap<String, CoverageArray> sizing)
Default z="-Xmx1g" allocation via freeRam 3200m 84 calculation typically sufficient for most genomes

Real-World Applications

JGI Contamination Removal Pipeline

Masked references of various vertebrate organisms were prepared using BBMask for removing contaminant reads from libraries with zero false-positives. The process involved:

Using BBMask with default settings on target genomes
Shredding all Mycocosm and Phytozome genomes into 80bp overlapping pieces
Mapping these fragments to vertebrate genomes and using resulting SAM files for masking
Manual verification to distinguish contamination from legitimate sequence similarity
Conservative masking approach: sequences shared between human and fungi are masked if they also align to fish (evolutionary bottleneck principle)

Ribosomal RNA Masking

BBMask can effectively mask ribosomal sequences using databases like Silva, since rRNA sequences are highly conserved across species and can cause false-positive alignments in contamination detection.

Related Tools

For filtering low-entropy sequences rather than masking them, see the BBDuk documentation and BBDuk Guide.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Guide: BBMask Guide