BBTools Resources

This page provides download links for BBTools resource files that are too large to include in the main distribution. These include ribosomal databases, taxonomy trees, contamination filtering references, and sketching databases.

Most files are hosted at NERSC (National Energy Research Scientific Computing Center). Click any filename to download directly.

🧬 Ribosomal Resources

Essential ribosomal RNA databases for metatranscriptome filtering and taxonomic identification.

📥 ribokmers.fa.gz (9.2 MB)

Essential for metatranscriptome research.

The Problem: Raw metatranscriptome samples can contain 90-99% rRNA. This bloats files, wastes sequencing capacity, and can cause RNA-seq assemblers to crash. While wetlab ribodepletion kits remove the majority of rRNA particles, 50% of the remaining reads can still be rRNA.

Why Alignment Won't Work: Metatranscriptomes do not have known ribo sequences allowing alignment-based removal. Digital ribodepletion requires a different approach.

The Solution: riboKmers.fa.gz was designed using iterative passes of Silva through KCompress to find a minimal covering set of 31-mers. This captures >99.9% of 2x150bp read pairs synthetically generated from Silva SSU+LSU ribosomal sequences by prioritizing only the k-mers that, at each pass, capture the most reads.

Why Small is Better: At only 9MB, riboKmers minimizes the risk of coincidental sequence collisions with coding DNA while maintaining >99.9% rRNA capture efficiency.

Usage: Use with BBDuk for digital ribodepletion: bbduk.sh in=reads.fq out=clean.fq ref=riboKmers.fa.gz k=31

Created using: bbmap/pipelines/makeRiboKmers.sh

📥 all_prok_16S_best_taxsorted.fa.gz (22 MB)

A new reference standard for prokaryotic SSU sequences.

The Problem with Existing Databases: Public SSU databases contain massive redundancy - the same organism can have 30+ sequences ranging from partial fragments to N-filled sequences. Which one represents the species? Aligning queries against highly redundant, incomplete databases containing partial and low-quality sequences is very slow, and yields inconclusive results.

The Canonical Selection Solution: Each organism's 16S sequences are processed through a sophisticated algorithm:

All sequences aligned to create weighted traversal graph (HMM-like)
Consensus sequence generated from most probable path
Best real biological sequence selected using weighted scoring
Naturally favors complete, high-quality sequences while avoiding pseudogenes and fragments

The Result: One high-quality, complete 16S sequence per organism. Dramatically more compact than Silva 138 - at 30 MB versus 688 MB - while covering 76,225 NCBI TaxIDs instead of 48,000. BBDB combines curated sequences from Silva with gene-called sequences from RefSeq genomes; 37% of sequences come from RefSeq.

Quality Features: No N's. No U's. TaxID-labeled headers. One representative sequence per organism. Tax-sorted for optimal gzip compression.

Impact: Dramatically fewer alignments required, minimal disk space, zero ambiguity about which sequence to use. Transforms taxonomic classification from navigating redundancy to obtaining direct, conclusive answers.

📥 all_euk_18S_best_taxsorted.fa.gz (7.7 MB)

Eukaryotic SSU sequences, cleanly separated from organellar 16S.

The Eukaryote Problem: Both 16S and 18S are small subunit (SSU) rRNA. Eukaryotes have their own nuclear 18S, plus mitochondrial 16S. Plants add chloroplast 16S. Which represents the species? Mixed databases make finding the true eukaryotic SSU nearly impossible.

The Solution: Split databases. Prokaryotic 16S goes in one file, eukaryotic 18S in another. Organellar 16S sequences are excluded from the eukaryotic database. When both 16S and 18S are found for the same organism during gene-calling, the 16S is discarded and the 18S is considered representative.

The Result: One high-quality, complete 18S sequence per eukaryotic organism. Same canonical selection process as the prokaryotic database - consensus generation followed by best-sequence selection using weighted HMM-like scoring.

Quality Features: No N's. No U's. TaxID-labeled headers. Nuclear 18S only (no mito/chloroplast contamination). Tax-sorted for optimal gzip compression.

Usage: Integrated into BBTools programs like SendSketch and QuickClade for taxonomic classification. Provides exact SSU ANI (Average Nucleotide Identity) calculations via glocal alignment, complementing pentamer-frequency methods.

📊 Taxonomy

📥 tree.taxtree.gz

Complete NCBI taxonomic tree with all numeric TaxIDs and node names. Used by programs that require taxonomy. Should be placed in the /resources/ directory.

Created using: bbmap/pipelines/fetchTaxonomy.sh

Note: Needs to be updated periodically as NCBI taxonomy changes.

🧹 Contamination Filtering References

Masked reference genomes for removing contaminant reads from sequencing data.

📥 hg19_masked.fa.gz

Masked version of human reference genome (hg19) for removing human contaminant reads.

📥 mouse_masked.fa.gz

Masked version of mouse reference genome for removing mouse contaminant reads.

📥 cat_masked.fa.gz

Masked version of cat reference genome for removing cat contaminant reads.

📥 dog_masked.fa.gz

Masked version of dog reference genome for removing dog contaminant reads.

📥 fusedEPmasked2.fa.gz

Masked version of common bacterial contaminants for removing bacterial contamination from eukaryotic samples.

📥 fusedERPBBmasked2.fa.gz

Masked version of common bacterial contaminants for removing bacterial contamination from bacterial samples. Same organisms as above, but more heavily masked for sequences shared with other non-contaminant bacteria.

🔍 Sketching Databases

Pre-computed BBSketch databases for rapid taxonomic identification and genome comparison.

📥 refseqA48_with_ribo.spectra.gz (~240 MB)

RefSeq spectra database with ribosomal sequences. Contains trimer, tetramer, and pentamer counts of RefSeq genomes. Used by QuickClade and programs that use it indirectly (like GradeBins).

Installation: Can be dropped into /resources/ directory.

📥 refseqSketch.tar

BBSketches of RefSeq genomes, one sketch per TaxID. Untar before use.

📥 prokProtSketch.tar

BBSketches in protein space of RefSeq prokaryotes, one sketch per TaxID. Untar before use.

📥 ntSketch.tar

BBSketches of NCBI nt database, one sketch per TaxID. Untar before use.

📥 silvaSeqSketch.tar

BBSketches of Silva database, one sketch per sequence. Untar before use.

📥 silvaTaxaSketch.tar

BBSketches of Silva database, one sketch per TaxID. Untar before use.

⚙️ RQCFilter Data

📥 RQCFilterData.tar (~10+ GB)

Complete data dependencies for RQCFilter. Required for using rqcfilter2.sh outside of JGI systems.

Warning: Very large file. Only download if you need full RQCFilter functionality.

📍 File Locations

NERSC Portal: https://portal.nersc.gov/dna/microbial/assembly/bushnell/

SourceForge: https://sourceforge.net/projects/bbmap/files/Resources/

Silva License: Some ribosomal data is derived from Silva. See SILVA_LICENSE.txt for details.

Need help? Have questions about which resources to use?

Ask on GitHub