BBMask

Script: bbmask.sh Package: jgi Class: BBMask.java

Masks sequences of low-complexity, or containing repeat kmers, or covered by mapped reads. Designed as a replacement for tools like Dust which are slow and do not work well for preventing false-positive matches in highly-conserved or low-complexity regions of genomes. By default masks using entropy with window=80 and entropy=0.70.

Overview

BBMask performs three types of masking, all optional and can be used individually or in combination:

  1. Low-entropy (complexity) - Identifies repetitive or low-complexity sequences using Shannon entropy calculation
  2. Tandem-repeated kmers - Masks exact k-mer repeats above specified frequency thresholds
  3. SAM-file coverage - Masks regions based on read mapping coverage from alignment files

BBMask loads all sequences into memory to allow multiple masking operations and requires approximately 1 byte per base for FASTA input.

Basic Usage

bbmask.sh in=<file> out=<file> sam=<file,file,...file>

Input may be stdin or a fasta or fastq file, raw or gzipped. SAM files are optional but may be a comma-delimited list of SAM files to mask. SAM files may also be used as arguments without sam=, so you can use *.sam for example. If you pipe via stdin/stdout, please include the file type; e.g. for gzipped fasta input, set in=stdin.fa.gz

Parameters

BBMask organizes parameters into logical groups based on their function in the masking process. The tool supports three main masking strategies: repeat detection, low-complexity/entropy masking, and coverage-based masking from SAM files.

Input parameters

in=<file>
Input sequences to mask. 'in=stdin.fa' will pipe from standard in.
sam=<file,file>
Comma-delimited list of sam files. Optional. Their mapped coordinates will be masked.
touppercase=f
(tuc) Change all letters to upper-case.
interleaved=auto
(int) If true, forces fastq input to be paired and interleaved.
qin=auto
ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.

Output parameters

out=<file>
Write masked sequences here. 'out=stdout.fa' will pipe to standard out.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file.
ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
fastawrap=70
Length of lines in fasta output.
qout=auto
ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).

Processing parameters

threads=auto
(t) Set number of threads to use; default is number of logical processors.
maskrepeats=f
(mr) Mask areas covered by exact repeat kmers.
kr=5
Kmer size to use for repeat detection (1-15). Use minkr and maxkr to sweep a range of kmers.
minlen=40
Minimum length of repeat area to mask.
mincount=4
Minimum number of repeats to mask.
masklowentropy=t
(mle) Mask areas with low complexity by calculating entropy over a window for a fixed kmer size.
ke=5
Kmer size to use for entropy calculation (1-15). Use minke and maxke to sweep a range. Large ke uses more memory.
window=80
(w) Window size for entropy calculation.
entropy=0.70
(e) Mask windows with entropy under this value (0-1). 0.0001 will mask only homopolymers and 1 will mask everything.
lowercase=f
(lc) Convert masked bases to lower case. Default is to convert them to N.
split=f
Split into unmasked pieces and discard masked pieces.

Coverage parameters (only relevant if sam files are specified)

mincov=-1
If nonnegative, mask bases with coverage outside this range.
maxcov=-1
If nonnegative, mask bases with coverage outside this range.
delcov=t
Include deletions when calculating coverage.

Note: If neither mincov nor maxcov are set, all covered bases will be masked.

Other parameters

pigz=t
Use pigz to compress. If argument is a number, that will set the number of pigz threads.
unpigz=t
Use pigz to decompress.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Entropy Threshold Guidelines

Entropy is calculated using Shannon Entropy of kmers in a window, and varies from 0 (mask nothing) to 1 (mask everything). It can be challenging to determine the optimal threshold, so here are reference points:

Entropy Threshold E. coli Genome Human Genome
0.7 (default) 107 bases masked ~0.7% masked
0.9 ~8 kbp masked ~7% masked

Examples

Basic Entropy Masking

bbmask.sh in=ref.fa out=masked.fa entropy=0.7

Masks low-entropy regions using default parameters. This is the most common usage for general sequence masking.

Contamination Removal Workflow

# Step 1: Shred contaminant genome into overlapping fragments
shred.sh in=contaminant.fa out=shredded.fa length=80 minlength=70 overlap=40

# Step 2: Map fragments to target genome
bbmap.sh ref=target.fa in=shredded.fa outm=mapped.sam minid=0.85 maxindel=2

# Step 3: Mask similar regions in target genome
bbmask.sh in=target.fa out=masked.fa entropy=0.7 sam=mapped.sam

Complete workflow for masking sequences in genome A that are similar to those in genome B, plus low-entropy sequences. This approach is used at JGI for vertebrate contamination removal with zero false-positives.

Repeat Masking

bbmask.sh in=assembly.fasta out=masked.fasta mr=t kr=7 minlen=50 mincount=3

Enables repeat masking with 7-mers, requiring at least 3 repeats and minimum length of 50bp to mask.

Coverage-Based Masking

bbmask.sh in=assembly.fasta out=masked.fasta sam=reads.sam mincov=5 maxcov=100

Masks regions with coverage outside the range 5-100x using alignment data from reads.sam.

Combined Masking Strategies

bbmask.sh in=assembly.fasta out=masked.fasta mr=t mle=t sam=reads1.sam,reads2.sam

Applies all three masking strategies: repeat masking, low-entropy masking, and coverage masking from multiple SAM files.

Soft Masking

bbmask.sh in=assembly.fasta out=soft_masked.fasta lc=t entropy=0.6

Converts masked bases to lowercase instead of 'N', useful for downstream analysis that distinguishes hard vs soft masking.

Algorithm Details

Three-Strategy Masking Approach

BBMask implements three independent masking strategies that can be used individually or in combination:

1. Low-Entropy/Complexity Masking (masklowentropy=t)

Identifies low-complexity regions using Shannon entropy calculation over sliding windows via maskLowEntropy() and EntropyTracker methods:

2. Tandem-Repeated K-mer Masking (maskrepeats=t)

Uses exact k-mer matching via getInitialKey() and repeatLength() methods to identify repetitive sequences:

3. SAM-File Coverage Masking

Masks regions based on read mapping coverage from SAM files via maskSam_MT() and MaskSamThread processing:

Performance Optimizations

Masking Output Options

Memory Requirements

BBMask requires approximately 1 byte per base for FASTA input. Memory usage scales with:

Real-World Applications

JGI Contamination Removal Pipeline

Masked references of various vertebrate organisms were prepared using BBMask for removing contaminant reads from libraries with zero false-positives. The process involved:

Ribosomal RNA Masking

BBMask can effectively mask ribosomal sequences using databases like Silva, since rRNA sequences are highly conserved across species and can cause false-positive alignments in contamination detection.

Related Tools

For filtering low-entropy sequences rather than masking them, see the BBDuk documentation and BBDuk Guide.

Support

For questions and support: