BBMask
Masks sequences of low-complexity, or containing repeat kmers, or covered by mapped reads. Designed as a replacement for tools like Dust which are slow and do not work well for preventing false-positive matches in highly-conserved or low-complexity regions of genomes. By default masks using entropy with window=80 and entropy=0.70.
Overview
BBMask performs three types of masking, all optional and can be used individually or in combination:
- Low-entropy (complexity) - Identifies repetitive or low-complexity sequences using Shannon entropy calculation
- Tandem-repeated kmers - Masks exact k-mer repeats above specified frequency thresholds
- SAM-file coverage - Masks regions based on read mapping coverage from alignment files
BBMask loads all sequences into memory to allow multiple masking operations and requires approximately 1 byte per base for FASTA input.
Basic Usage
bbmask.sh in=<file> out=<file> sam=<file,file,...file>
Input may be stdin or a fasta or fastq file, raw or gzipped. SAM files are optional but may be a comma-delimited list of SAM files to mask. SAM files may also be used as arguments without sam=, so you can use *.sam for example. If you pipe via stdin/stdout, please include the file type; e.g. for gzipped fasta input, set in=stdin.fa.gz
Parameters
BBMask organizes parameters into logical groups based on their function in the masking process. The tool supports three main masking strategies: repeat detection, low-complexity/entropy masking, and coverage-based masking from SAM files.
Input parameters
- in=<file>
- Input sequences to mask. 'in=stdin.fa' will pipe from standard in.
- sam=<file,file>
- Comma-delimited list of sam files. Optional. Their mapped coordinates will be masked.
- touppercase=f
- (tuc) Change all letters to upper-case.
- interleaved=auto
- (int) If true, forces fastq input to be paired and interleaved.
- qin=auto
- ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.
Output parameters
- out=<file>
- Write masked sequences here. 'out=stdout.fa' will pipe to standard out.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
- fastawrap=70
- Length of lines in fasta output.
- qout=auto
- ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).
Processing parameters
- threads=auto
- (t) Set number of threads to use; default is number of logical processors.
- maskrepeats=f
- (mr) Mask areas covered by exact repeat kmers.
- kr=5
- Kmer size to use for repeat detection (1-15). Use minkr and maxkr to sweep a range of kmers.
- minlen=40
- Minimum length of repeat area to mask.
- mincount=4
- Minimum number of repeats to mask.
- masklowentropy=t
- (mle) Mask areas with low complexity by calculating entropy over a window for a fixed kmer size.
- ke=5
- Kmer size to use for entropy calculation (1-15). Use minke and maxke to sweep a range. Large ke uses more memory.
- window=80
- (w) Window size for entropy calculation.
- entropy=0.70
- (e) Mask windows with entropy under this value (0-1). 0.0001 will mask only homopolymers and 1 will mask everything.
- lowercase=f
- (lc) Convert masked bases to lower case. Default is to convert them to N.
- split=f
- Split into unmasked pieces and discard masked pieces.
Coverage parameters (only relevant if sam files are specified)
- mincov=-1
- If nonnegative, mask bases with coverage outside this range.
- maxcov=-1
- If nonnegative, mask bases with coverage outside this range.
- delcov=t
- Include deletions when calculating coverage.
Note: If neither mincov nor maxcov are set, all covered bases will be masked.
Other parameters
- pigz=t
- Use pigz to compress. If argument is a number, that will set the number of pigz threads.
- unpigz=t
- Use pigz to decompress.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Entropy Threshold Guidelines
Entropy is calculated using Shannon Entropy of kmers in a window, and varies from 0 (mask nothing) to 1 (mask everything). It can be challenging to determine the optimal threshold, so here are reference points:
Entropy Threshold | E. coli Genome | Human Genome |
---|---|---|
0.7 (default) | 107 bases masked | ~0.7% masked |
0.9 | ~8 kbp masked | ~7% masked |
Examples
Basic Entropy Masking
bbmask.sh in=ref.fa out=masked.fa entropy=0.7
Masks low-entropy regions using default parameters. This is the most common usage for general sequence masking.
Contamination Removal Workflow
# Step 1: Shred contaminant genome into overlapping fragments
shred.sh in=contaminant.fa out=shredded.fa length=80 minlength=70 overlap=40
# Step 2: Map fragments to target genome
bbmap.sh ref=target.fa in=shredded.fa outm=mapped.sam minid=0.85 maxindel=2
# Step 3: Mask similar regions in target genome
bbmask.sh in=target.fa out=masked.fa entropy=0.7 sam=mapped.sam
Complete workflow for masking sequences in genome A that are similar to those in genome B, plus low-entropy sequences. This approach is used at JGI for vertebrate contamination removal with zero false-positives.
Repeat Masking
bbmask.sh in=assembly.fasta out=masked.fasta mr=t kr=7 minlen=50 mincount=3
Enables repeat masking with 7-mers, requiring at least 3 repeats and minimum length of 50bp to mask.
Coverage-Based Masking
bbmask.sh in=assembly.fasta out=masked.fasta sam=reads.sam mincov=5 maxcov=100
Masks regions with coverage outside the range 5-100x using alignment data from reads.sam.
Combined Masking Strategies
bbmask.sh in=assembly.fasta out=masked.fasta mr=t mle=t sam=reads1.sam,reads2.sam
Applies all three masking strategies: repeat masking, low-entropy masking, and coverage masking from multiple SAM files.
Soft Masking
bbmask.sh in=assembly.fasta out=soft_masked.fasta lc=t entropy=0.6
Converts masked bases to lowercase instead of 'N', useful for downstream analysis that distinguishes hard vs soft masking.
Algorithm Details
Three-Strategy Masking Approach
BBMask implements three independent masking strategies that can be used individually or in combination:
1. Low-Entropy/Complexity Masking (masklowentropy=t)
Identifies low-complexity regions using Shannon entropy calculation over sliding windows via maskLowEntropy() and EntropyTracker methods:
- Entropy Mode (default): Calculates Shannon entropy via EntropyTracker.passes() over sliding windows (default 80bp) using k-mers (default k=5)
- Complexity Mode: Counts unique k-mers using short[] arrays (kmerspace = 1<<(2*k)) and masks regions below complexity threshold
- Uses EntropyTracker objects initialized with EntropyTracker(k, windowT, false, cutoff, true) for multi-threaded calculation
- Entropy cutoff of 0.70 processed through EntropyTracker.windowBases() method calls
- MaskLowEntropyThread workers poll from ArrayBlockingQueue<Read> for parallel processing
2. Tandem-Repeated K-mer Masking (maskrepeats=t)
Uses exact k-mer matching via getInitialKey() and repeatLength() methods to identify repetitive sequences:
- Scans sequences with sliding k-mer windows (default k=5) using maskRepeats() with bit-packed representation
- Identifies k-mers that occur multiple times using kmer comparison against initial key
- Marks regions containing repeated k-mers for masking using BitSet.set() operations
- Only masks regions meeting minimum length (minlen=40) and count (mincount=4) thresholds
- Uses bit masks (mask = ~((-1)<<(2*k))) for k≤15 via Dedupe.baseToNumber array indexing
3. SAM-File Coverage Masking
Masks regions based on read mapping coverage from SAM files via maskSam_MT() and MaskSamThread processing:
- Parses SAM alignment coordinates using SamLine.start() and SamLine.stop() to build coverage arrays
- Supports CoverageArray2 (16-bit) and CoverageArray3 (32-bit) based on bits32 flag and coverage thresholds
- Masks high-coverage regions via setHighCoverage() and low-coverage via setLowCoverage() BitSet operations
- Includes optional deletion coverage via fillRanges() with includeDeletionCoverage parameter
- Uses ConcurrentHashMap<String, CoverageArray> for thread-safe coverage tracking across scaffolds
Performance Optimizations
- BitSet Data Structure: Uses Java BitSet with .set(), .get(), and .cardinality() operations for memory-efficient masking coordinate storage
- Multi-threading: MaskRepeatThread and MaskLowEntropyThread classes for parallel processing with Shared.threads() allocation
- Memory Management: Dynamic CoverageArray selection via bits32=(mincov>=Character.MAX_VALUE || maxcov>=Character.MAX_VALUE)
- Streaming I/O: ConcurrentReadInputStream.getReadInputStream() and ConcurrentReadOutputStream.getStream() for efficient file processing
- K-mer Optimization: Bit-packed representation using ((kmer<<2)|n)&mask operations for k≤15 reduces memory footprint
Masking Output Options
- Hard Masking (default): Replaces masked bases with 'N' via maskRead() method setting bases[i]='N'
- Soft Masking (lowercase=t): Converts masked bases to lowercase via Tools.toLowerCase() for downstream analysis
- Split Mode (split=t): Removes masked regions via splitFromBitsets() using KillSwitch.copyOfRange() to output unmasked fragments
Memory Requirements
BBMask requires approximately 1 byte per base for FASTA input. Memory usage scales with:
- Input sequence length (BitSet allocation via new BitSet(len))
- K-mer size for entropy/repeat detection (kmerspace = (1<<(2*k)) array allocation)
- Number of SAM files and coverage depth (ConcurrentHashMap<String, CoverageArray> sizing)
- Default z="-Xmx1g" allocation via freeRam 3200m 84 calculation typically sufficient for most genomes
Real-World Applications
JGI Contamination Removal Pipeline
Masked references of various vertebrate organisms were prepared using BBMask for removing contaminant reads from libraries with zero false-positives. The process involved:
- Using BBMask with default settings on target genomes
- Shredding all Mycocosm and Phytozome genomes into 80bp overlapping pieces
- Mapping these fragments to vertebrate genomes and using resulting SAM files for masking
- Manual verification to distinguish contamination from legitimate sequence similarity
- Conservative masking approach: sequences shared between human and fungi are masked if they also align to fish (evolutionary bottleneck principle)
Ribosomal RNA Masking
BBMask can effectively mask ribosomal sequences using databases like Silva, since rRNA sequences are highly conserved across species and can cause false-positive alignments in contamination detection.
Related Tools
For filtering low-entropy sequences rather than masking them, see the BBDuk documentation and BBDuk Guide.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Guide: BBMask Guide