BBDuk

Script: bbduk.sh Package: jgi Class: BBDuk.java

Decontamination Using Kmers. Kmer-based sequence trimming, filtering, and masking tool that combines most common data-quality-related operations into a single pass: quality trimming, adapter removal, contaminant filtering, sequence masking, GC filtering, length filtering, entropy filtering, format conversion, histogram generation, and various other operations.

Basic Usage

bbduk.sh in=<input file> out=<output file> ref=<contaminant files>

Input may be stdin or a fasta or fastq file, compressed or uncompressed. If you pipe via stdin/stdout, please include the file type; e.g. for gzipped fasta input, set in=stdin.fa.gz

Memory Management

BBDuk autodetects available memory and uses approximately half by default. Memory requirements are linearly proportional to reference size: only processing large references, or using high values of hdist or edist, require substantial memory. E. coli (4.6MB) uses ~140MB with default settings; hdist=1 expands to ~15GB due to storing 94× more kmers (427 million vs 4.5 million).

Memory Recommendations

  • Small references (adapters, bacterial genomes): -Xmx1g
  • Large references (mammalian genomes): Use ~85% of physical memory (e.g., -Xmx27g on 32GB machine)
  • High edit distance: Memory scales as (3×K)^hdist for hamming distance, (8×K)^edist for edit distance

Threading Architecture

Processing occurs in two phases: Phase 1 loads reference kmers using 7 fixed threads (typically under 1 second for small references). Phase 2 autodetects and uses all available processors unless restricted with the 't' parameter. BBDuk scales well across multiple cores unless on a shared node.

Common Use Cases

Typical Preprocessing Pipeline Order

For complete read preprocessing, the recommended multi-pass approach follows this order:

  1. Force trim modulo: ftm=5 removes inaccurate extra bases (151bp → 150bp)
  2. Adapter trimming: ktrim=r k=23 mink=11 hdist=1 tpe tbo
  3. Quality trimming: qtrim=r trimq=10 (uses Phred algorithm)
  4. Contaminant filtering: Remove PhiX, rRNA, or custom references
  5. Length/quality filtering: minlen=50 maq=20

Built-in References

BBDuk includes commonly-used reference sequences accessible by keyword:

Parameters

BBDuk parameters are organized by function. Note: if ktrim, kmask, and ksplit are unset, the default behavior is kfilter. All kmer processing modes are mutually exclusive. Reads only get sent to 'outm' based on kmer matches in kfilter mode.

Input parameters

in=<file>
Main input. in=stdin.fq will pipe from stdin.
in2=<file>
Input for 2nd read of pairs in a different file.
ref=<file,file>
Comma-delimited list of reference files. In addition to filenames, you may also use the keywords: adapters, artifacts, phix, lambda, pjet, mtst, kapa
literal=<seq,seq>
Comma-delimited list of literal reference sequences. Polymers are also allowed with the 'poly' prefix; for example, 'literal=ATGGT,polyGC' will add both ATGGT and GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC - 32+ of them, enough replicates to ensure that all kmers are present.
touppercase=f
(tuc) Change all bases upper-case.
interleaved=auto
(int) t/f overrides interleaved autodetection. Must be set manually when streaming fastq input.
qin=auto
Input quality offset: 33 (Sanger), 64, or auto.
reads=-1
If positive, quit after processing X reads or pairs.
copyundefined=f
(cu) Process non-AGCT IUPAC reference bases by making all possible unambiguous copies. Intended for short motifs or adapter barcodes, as time/memory use is exponential.
samplerate=1
Set lower to only process a fraction of input reads.
samref=<file>
Optional reference fasta for processing sam files.

Output parameters

out=<file>
(outnonmatch) Write reads here that do not contain kmers matching the database. 'out=stdout.fq' will pipe to standard out.
out2=<file>
(outnonmatch2) Use this to write 2nd read of pairs to a different file.
outm=<file>
(outmatch) Write reads here that fail filters. In default kfilter mode, this means any read with a matching kmer. In any mode, it also includes reads that fail filters such as minlength, mingc, maxgc, entropy, etc. In other words, it includes all reads that do not go to 'out'.
outm2=<file>
(outmatch2) Use this to write 2nd read of pairs to a different file.
outs=<file>
(outsingle) Use this to write singleton reads whose mate was trimmed shorter than minlen.
stats=<file>
Write statistics about which contamininants were detected.
refstats=<file>
Write statistics on a per-reference-file basis.
rpkm=<file>
Write RPKM for each reference sequence (for RNA-seq).
dump=<file>
Dump kmer tables to a file, in fasta format.
duk=<file>
Write statistics in duk's format. *DEPRECATED*
nzo=t
Only write statistics about ref sequences with nonzero hits.
overwrite=t
(ow) Grant permission to overwrite files.
showspeed=t
(ss) 'f' suppresses display of processing speed.
ziplevel=2
(zl) Compression level; 1 (min) through 9 (max).
fastawrap=70
Length of lines in fasta output.
qout=auto
Output quality offset: 33 (Sanger), 64, or auto.
statscolumns=3
(cols) Number of columns for stats output, 3 or 5. 5 includes base counts.
rename=f
Rename reads to indicate which sequences they matched.
refnames=f
Use names of reference files rather than scaffold IDs.
trd=f
Truncate read and ref names at the first whitespace.
ordered=f
Set to true to output reads in same order as input.
maxbasesout=-1
If positive, quit after writing approximately this many bases to out (outu/outnonmatch).
maxbasesoutm=-1
If positive, quit after writing approximately this many bases to outm (outmatch).
json=f
Print to screen in json format.

Histogram output parameters

bhist=<file>
Base composition histogram by position.
qhist=<file>
Quality histogram by position.
qchist=<file>
Count of bases with each quality value.
aqhist=<file>
Histogram of average read quality.
bqhist=<file>
Quality histogram designed for box plots.
lhist=<file>
Read length histogram.
phist=<file>
Polymer length histogram.
gchist=<file>
Read GC content histogram.
enthist=<file>
Read entropy histogram.
ihist=<file>
Insert size histogram, for paired reads in mapped sam.
gcbins=100
Number gchist bins. Set to 'auto' to use read length.
maxhistlen=6000
Set an upper bound for histogram lengths; higher uses more memory. The default is 6000 for some histograms and 80000 for others.

Histogram parameters for mapped sam/bam files only

histbefore=t
Calculate histograms from reads before processing.
ehist=<file>
Errors-per-read histogram.
qahist=<file>
Quality accuracy histogram of error rates versus quality score.
indelhist=<file>
Indel length histogram.
mhist=<file>
Histogram of match, sub, del, and ins rates by position.
idhist=<file>
Histogram of read count versus percent identity.
idbins=100
Number idhist bins. Set to 'auto' to use read length.
varfile=<file>
Ignore substitution errors listed in this file when calculating error rates. Can be generated with CallVariants.
vcf=<file>
Ignore substitution errors listed in this VCF file when calculating error rates.
ignorevcfindels=t
Also ignore indels listed in the VCF.

Processing parameters

k=27
Kmer length used for finding contaminants. Contaminants shorter than k will not be found. k must be at least 1. For k>31, BBDuk uses emulated long kmers by requiring consecutive 31-mer matches.
rcomp=t
Look for reverse-complements of kmers in addition to forward kmers. Set rcomp=f to match only forward orientation if exact directional matching is required.
maskmiddle=t
(mm) Treat the middle base of a kmer as a wildcard, to increase sensitivity in the presence of errors. This may also be set to a number, e.g. mm=3, to mask that many bp. Default mm=t corresponds to mm=1 for odd-length kmers and mm=2 for even-length kmers, while mm=f equals mm=0. Disabled when using mink or kbig>k.
minkmerhits=1
(mkh) Reads need at least this many matching kmers to be considered as matching the reference.
minkmerfraction=0.0
(mkf) A reads needs at least this fraction of its total kmers to hit a ref, in order to be considered a match. If this and minkmerhits are set, the greater is used.
mincovfraction=0.0
(mcf) A reads needs at least this fraction of its total bases to be covered by ref kmers to be considered a match. If specified, mcf overrides mkh and mkf.
hammingdistance=0
(hdist) Maximum Hamming distance for ref kmers (subs only). Memory use scales as (3×K)^hdist: E. coli with k=31, hdist=1 uses ~15GB vs 140MB at hdist=0.
qhdist=0
Hamming distance for query kmers; impacts speed, not memory. Alternative to hdist for memory-constrained systems - mutates read kmers instead of reference kmers.
editdistance=0
(edist) Maximum edit distance from ref kmers (subs and indels). Memory use scales as (8×K)^edist. Use qhdist instead of edist/hdist if insufficient memory available.
hammingdistance2=0
(hdist2) Sets hdist for short kmers, when using mink.
qhdist2=0
Sets qhdist for short kmers, when using mink.
editdistance2=0
(edist2) Sets edist for short kmers, when using mink.
forbidn=f
(fn) Forbids matching of read kmers containing N. By default, these will match a reference 'A' if hdist>0 or edist>0, to increase sensitivity.
removeifeitherbad=t
(rieb) Paired reads get sent to 'outmatch' if either is match (or either is trimmed shorter than minlen). Set to false to require both.
trimfailures=f
Instead of discarding failed reads, trim them to 1bp. This makes the statistics a bit odd.
findbestmatch=f
(fbm) If multiple matches, associate read with sequence sharing most kmers. Reduces speed.
skipr1=f
Don't do kmer-based operations on read 1.
skipr2=f
Don't do kmer-based operations on read 2.
ecco=f
For overlapping paired reads only. Performs error-correction with BBMerge prior to kmer operations.
recalibrate=f
(recal) Recalibrate quality scores. Requires calibration matrices generated by CalcTrueQuality.
sam=<file,file>
If recalibration is desired, and matrices have not already been generated, BBDuk will create them from the sam file.
amino=f
Run in amino acid mode. Some features have not been tested, but kmer-matching works fine. Maximum k is 12.

Speed and Memory parameters

threads=auto
(t) Set number of threads to use; default is number of logical processors.
prealloc=f
Preallocate memory in table. Allows faster table loading and reduced memory usage for large references.
monitor=f
Kill this process if it crashes. monitor=600,0.01 would kill after 600 seconds under 1% usage.
minrskip=1
(mns) Force minimal skip interval when indexing reference kmers. 1 means use all, 2 means use every other kmer, etc.
maxrskip=1
(mxs) Restrict maximal skip interval when indexing reference kmers. Normally all are used for scaffolds<100kb, but with longer scaffolds, up to maxrskip-1 are skipped.
rskip=
Set both minrskip and maxrskip to the same value. If not set, rskip will vary based on sequence length.
qskip=1
Skip query kmers to increase speed. 1 means use all.
speed=0
Ignore this fraction of kmer space (0-15 out of 16) in both reads and reference. Increases speed and reduces memory.

Trimming/Filtering/Masking parameters

ktrim=f
Trim reads to remove bases matching reference kmers, plus all bases to the left or right. Values: f (don't trim), r (trim to the right), l (trim to the left)
ktrimtips=0
Set this to a positive number to perform ktrim on both ends, examining only the outermost X bases.
kmask=
Replace bases matching ref kmers with another symbol. Allows any non-whitespace character, and processes short kmers on both ends if mink is set. 'kmask=lc' will convert masked bases to lowercase.
maskfullycovered=f
(mfc) Only mask bases that are fully covered by kmers.
ksplit=f
For single-ended reads only. Reads will be split into pairs around the kmer. If the kmer is at the end of the read, it will be trimmed instead. Singletons will go to out, and pairs will go to outm. Do not use ksplit with other operations such as quality-trimming or filtering.
mink=0
Look for shorter kmers at read tips down to this length, when k-trimming or masking. 0 means disabled. Essential for adapter trimming when adapter remnants are shorter than k. Enabling this will disable maskmiddle.
qtrim=f
Trim read ends to remove bases with quality below trimq. Performed AFTER looking for kmers. Uses Phred algorithm for more accurate trimming than naive approaches. Values: rl (trim both ends), f (neither end), r (right end only), l (left end only), w (sliding window).
trimq=6
Regions with average quality BELOW this will be trimmed, if qtrim is set to something other than f. Can be a floating-point number like 7.3.
quantize
Bin quality scores to reduce file size. quantize=2 will eliminate all odd quality scores, while quantize=0,10,37 will only allow quality scores of 0, 10, or 37.
trimclip=f
Trim soft-clipped bases from sam files.
minlength=10
(ml) Reads shorter than this after trimming will be discarded. Pairs will be discarded if both are shorter.
mlf=0
(minlengthfraction) Reads shorter than this fraction of original length after trimming will be discarded.
maxlength=
Reads longer than this after trimming will be discarded.
minavgquality=0
(maq) Reads with average quality (after trimming) below this will be discarded.
maqb=0
If positive, calculate maq from this many initial bases.
minbasequality=0
(mbq) Reads with any base below this quality (after trimming) will be discarded.
maxns=-1
If non-negative, reads with more Ns than this (after trimming) will be discarded.
mcb=0
(minconsecutivebases) Discard reads without at least this many consecutive called bases.
ottm=f
(outputtrimmedtomatch) Output reads trimmed to shorter than minlength to outm rather than discarding.
tp=0
(trimpad) Trim this much extra around matching kmers.
tbo=f
(trimbyoverlap) Trim adapters based on where paired reads overlap. Uses BBMerge internally and does not require known adapter sequences.
strictoverlap=t
Adjust sensitivity for trimbyoverlap mode.
minoverlap=14
Require this many bases of overlap for detection.
mininsert=40
Require insert size of at least this for overlap. Should be reduced to 16 for small RNA sequencing.
tpe=f
(trimpairsevenly) When kmer right-trimming, trim both reads to the minimum length of either. Useful when adapter kmer detected in only one read of a pair.
forcetrimleft=0
(ftl) If positive, trim bases to the left of this position (exclusive, 0-based).
forcetrimright=0
(ftr) If positive, trim bases to the right of this position (exclusive, 0-based).
forcetrimright2=0
(ftr2) If positive, trim this many bases on the right end.
forcetrimmod=0
(ftm) If positive, right-trim length to be equal to zero, modulo this number. Use ftm=5 to remove inaccurate extra bases from 151bp reads (converts to 150bp), common in Illumina runs.
restrictleft=0
If positive, only look for kmer matches in the leftmost X bases.
restrictright=0
If positive, only look for kmer matches in the rightmost X bases.
mingc=0
Discard reads with GC content below this.
maxgc=1
Discard reads with GC content above this.
gcpairs=t
Use average GC of paired reads. Also affects gchist.
tossjunk=f
Discard reads with invalid characters as bases.
swift=f
Trim Swift sequences: Trailing C/T/N R1, leading G/A/N R2.

Header-parsing parameters - these require Illumina headers

chastityfilter=f
(cf) Discard reads with id containing ' 1:Y:' or ' 2:Y:'.
barcodefilter=f
Remove reads with unexpected barcodes if barcodes is set, or barcodes containing 'N' otherwise. A barcode must be the last part of the read header. Values: t: Remove reads with bad barcodes. f: Ignore barcodes. crash: Crash upon encountering bad barcodes.
barcodes=
Comma-delimited list of barcodes or files of barcodes.
xmin=-1
If positive, discard reads with a lesser X coordinate.
ymin=-1
If positive, discard reads with a lesser Y coordinate.
xmax=-1
If positive, discard reads with a greater X coordinate.
ymax=-1
If positive, discard reads with a greater Y coordinate.

Polymer trimming parameters

trimpolya=0
If greater than 0, trim poly-A or poly-T tails of at least this length on either end of reads.
trimpolygleft=0
If greater than 0, trim poly-G prefixes of at least this length on the left end of reads. Does not trim poly-C.
trimpolygright=0
If greater than 0, trim poly-G tails of at least this length on the right end of reads. Does not trim poly-C.
trimpolyg=0
This sets both left and right at once.
filterpolyg=0
If greater than 0, remove reads with a poly-G prefix of at least this length (on the left).

Polymer tracking parameters

pratio=base,base
'pratio=G,C' will print the ratio of G to C polymers.
plen=20
Length of homopolymers to count.

Entropy/Complexity parameters

entropy=-1
Set between 0 and 1 to filter reads with entropy below that value. Higher is more stringent.
entropywindow=50
Calculate entropy using a sliding window of this length.
entropyk=5
Calculate entropy using kmers of this length.
minbasefrequency=0
Discard reads with a minimum base frequency below this.
entropytrim=f
Values: f: (false) Do not entropy-trim. r: (right) Trim low entropy on the right end only. l: (left) Trim low entropy on the left end only. rl: (both) Trim low entropy on both ends.
entropymask=f
Values: f: (filter) Discard low-entropy sequences. t: (true) Mask low-entropy parts of sequences with N. lc: Change low-entropy parts of sequences to lowercase.
entropymark=f
Mark each base with its entropy value. This is on a scale of 0-41 and is reported as quality scores, so the output should be fastq or fasta+qual.

Cardinality estimation parameters

cardinality=f
(loglog) Count unique kmers using the LogLog algorithm. Provides accurate cardinality estimation using minimal memory, with no upper bound on kmer length.
cardinalityout=f
(loglogout) Count unique kmers in output reads.
loglogk=31
Use this kmer length for counting.
loglogbuckets=2048
Use this many buckets for counting.
khist=<file>
Kmer frequency histogram; plots number of kmers versus kmer depth. This is approximate.
khistout=<file>
Kmer frequency histogram for output reads.

Side Channel Parameters

sideout=<file>
Output for aligned reads.
sideref=phix
Reference for side-channel alignment; must be a single sequence and virtually repeat-free at selected k.
sidek1=17
Kmer length for seeding alignment to reference.
sidek2=13
Kmer length for seeding alignment of unaligned reads with an aligned mate.
sideminid1=0.66
Minimum identity to accept individual alignments.
sideminid2=0.58
Minimum identity for aligning reads with aligned mates.
sidemm1=1
Middle mask length for sidek1.
sidemm2=1
Middle mask length for sidek2.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Standard Adapter Trimming

# High-quality data (recommended settings)
bbduk.sh in=reads.fq out=clean.fq ref=adapters ktrim=r k=23 mink=11 hdist=1 tpe tbo

# Low-quality data (more sensitive settings)
bbduk.sh in=reads.fq out=clean.fq ref=adapters ktrim=r k=21 mink=11 hdist=2

Key parameters: ktrim=r (right-trimming for 3' adapters), mink=11 (shorter kmers at read ends), tpe (trim pairs evenly), tbo (trim by overlap detection using BBMerge).

Quality Trimming

# Quality trimming using Phred algorithm
bbduk.sh in=reads.fq out=clean.fq qtrim=r trimq=10

# Both ends quality trimming with length filtering
bbduk.sh in=reads.fq out=clean.fq qtrim=rl trimq=10 minlen=50

Note: Quality trimming uses the Phred algorithm, which is more accurate than naive trimming approaches. qtrim is performed AFTER kmer operations.

Force Trimming

# Remove inaccurate extra base (151bp → 150bp)
bbduk.sh in=reads.fq out=clean.fq ftm=5

# Trim specific positions
bbduk.sh in=reads.fq out=clean.fq ftl=10 ftr=139

ftm explanation: Illumina runs are usually multiples of 5bp (50, 75, 100), but sometimes generate extra bases (151bp) that are inaccurate with poorly calibrated quality scores.

Contamination Filtering

# Remove PhiX contamination with statistics
bbduk.sh in=reads.fq out=unmatched.fq outm=matched.fq ref=phix k=31 hdist=1 stats=stats.txt

# Multi-reference filtering using built-in keywords
bbduk.sh in=reads.fq out=clean.fq ref=adapters,phix,artifacts k=27

Built-in references: Use keywords adapters, phix, lambda, artifacts, etc. instead of file paths. outm captures contaminated reads; stats provides per-contaminant detection summary.

Full Preprocessing Pipeline

# Full preprocessing following recommended order
bbduk.sh in1=raw_R1.fastq in2=raw_R2.fastq \
         out1=clean_R1.fastq out2=clean_R2.fastq \
         outm1=contaminated_R1.fastq outm2=contaminated_R2.fastq \
         ref=adapters,phix ktrim=r k=27 mink=11 hdist=1 \
         qtrim=rl trimq=10 ftm=5 minlen=50 maq=20 \
         tpe tbo stats=contamination_stats.txt

Processing order: Force trimming (ftm) → Adapter trimming (ktrim) → Quality trimming (qtrim) → Length/quality filtering (minlen, maq). Multiple operations combined in single pass.

RNA-seq Processing

# RNA-seq with poly-A trimming and rRNA removal
bbduk.sh in=rnaseq.fastq out=clean.fastq ref=ribokmers.fa.gz \
         k=31 trimpolya=10 minlen=25 qtrim=r trimq=8

# Small RNA sequencing
bbduk.sh in=smallrna.fastq out=clean.fastq ref=adapters \
         ktrim=r k=19 mink=8 mininsert=16 minlen=16

Small RNA note: Reduce mininsert to 16 and use shorter k values, as small RNA libraries have shorter inserts than typical fragment libraries.

Memory-Constrained Processing

# Use query distance for large references with limited memory
bbduk.sh -Xmx8g in=reads.fq out=clean.fq ref=large_genome.fa qhdist=1 k=31

# Reduce memory usage with skip patterns
bbduk.sh in=reads.fq out=clean.fq ref=large_genome.fa rskip=3 k=27

Memory alternatives: qhdist mutates read kmers (speed penalty, constant memory) vs hdist mutates reference kmers (memory scales exponentially). rskip reduces sensitivity but saves memory.

Output Stream Usage

# Three-way splitting: pass pairs, fail pairs, pass singletons
bbduk.sh in=pairs.fq out=pass_pairs.fq outm=fail_pairs.fq \
         outs=pass_singletons.fq ref=contam.fasta

# Generate multiple histograms during processing
bbduk.sh in=reads.fq out=clean.fq bhist=base_hist.txt \
         qhist=qual_hist.txt gchist=gc_hist.txt lhist=len_hist.txt

Stream logic: out (passes all filters), outm (fails any filter), outs (singleton when mate fails). Histograms generated during processing with minimal overhead.

Algorithm Details

Kmer Processing Modes

BBDuk operates in one of four mutually exclusive kmer modes (only one kmer-based operation per pass):

Kmer Table Implementation

BBDuk uses AbstractKmerTable data structures for kmer storage and lookup, with automatic selection based on reference characteristics:

Edit Distance Processing

Hamming and edit distance processing expands the kmer space through systematic mutation enumeration:

Sensitivity Enhancement

Multiple features increase matching sensitivity in the presence of sequencing errors:

Performance Optimization Strategies

BBDuk provides multiple acceleration options, though all reduce sensitivity as a trade-off:

Internal Order of Operations

BBDuk processes operations in this fixed sequence (reads discarded as soon as length drops below minlen):

  1. Histograms and cardinality estimation (on raw reads)
  2. Header filtering (chastity, location, barcode)
  3. Quality recalibration
  4. GC filtering
  5. Force trimming (ftl/ftr/ftm)
  6. Kmer operations (ktrim/kmask/kfilter - mutually exclusive)
  7. Trim by overlap (tbo)
  8. Poly-A trimming
  9. Entropy masking
  10. Quality trimming (qtrim) and soft-clip trimming
  11. Average quality filtering (maq) and maxNs filtering
  12. Entropy filtering

Workflow Integration

Paired Read Processing

BBDuk automatically detects paired reads based on read names and maintains pair relationships throughout processing:

Multi-Reference Processing

When using multiple reference files, BBDuk associates each kmer with the first reference containing it:

Pipeline Integration

BBDuk serves as a central preprocessing tool in typical sequencing workflows:

Support

For questions and support: