Showing all 254 tools

๐Ÿ“Œ Universal BBTools Parameters

Most BBTools share common parameters for memory, threading, I/O, and more.

View Universal Parameters Guide โ†’

๐ŸŽฏ Core Alignment & Mapping Tools

BBMap
Splice-aware read alignment for DNA/RNA sequencing data. Handles very long indels in Illumina-length reads.
BBMapSkimmer
Alignment tool for finding and retaining multiple mapping sites in long PacBio reads for repetitive sequence analysis.
BBWrap
Batch alignment tool that maps multiple read files to the same reference without redundant index reloading
MapPacBio
BBMap variant for aligning noisy PacBio long reads with high error tolerance for sequencing reads with 10-15% error rates.
BBRealign
Realigns mapped reads to improve alignment accuracy for variant detection and assembly quality.
Seal
Alignment-free sequence assignment using k-mer matching. Useful for RNA-seq, contamination detection, abundance profiles, demultiplexing.

๐Ÿ”ฌ Alignment Algorithms

BandedAligner
Pairwise sequence alignment tool that aligns query sequences to reference sequences, outputting identity and alignment positions.
BandedPlusAligner
Tool for calculating exact Average Nucleotide Identity (ANI) between two sequences.
CrossCutAligner
Sequence alignment tool that calculates exact identity scores by processing alignment matrices using antidiagonal techniques.
DriftingAligner
Aligns DNA sequences using an adaptive dynamic programming algorithm that calculates sequence identity and alignment coordinates.
DriftingPlusAligner
Sequence alignment tool that aligns sequences using an adaptive banded dynamic programming algorithm for high-identity comparisons.
GlocalAligner
Aligns query sequences to reference sequences by exploring entire alignment matrices and outputting alignment statistics.
MicroAlign
Aligns reads to small, single-contig references like PhiX, plasmids, or viral genomes using specialized indexing.
MSA
Aligns query sequences to references, identifying the best match position for each reference sequence.
QuabbleAligner
Sequence alignment tool that calculates nucleotide identity and alignment positions without full matrix traceback.
QuantumAligner
Pairwise sequence alignment tool that uses sparse matrix exploration to determine identity, start, and stop locations
WavefrontAligner
Java implementation of Wavefront Aligner (Marco-Sola et al.) for sequence alignment visualization and educational use
WobbleAligner
Performs sequence alignment for analysis of high-identity genomic sequences
WobblePlusAligner
Aligns nucleotide sequences with high identity, calculating match statistics and supporting performance benchmarking

๐Ÿ” Quality Control & Preprocessing

BBDuk
Quality trimming, adapter removal, and contamination filtering for sequencing reads using k-mer matching.
BBMerge
Merges paired sequencing reads by detecting overlapping regions, with read joining, quality trimming, and adapter removal capabilities.
BBMerge-Auto
Automates memory management for read merging and error correction, supporting Tadpole-based read extension and kmer-based error correction without manual memory tuning.
Clumpify
Sequence preprocessing tool that groups similar reads together for compression, error correction, and optical duplicate removal.
Dedupe
Removes redundant sequences by identifying exact, near, and partial matches across input files.
Dedupe2
Removes duplicate sequences with clustering and overlap detection, supporting unlimited kmer prefix/suffix mapping for duplicate identification.
DedupeByMapping
Removes duplicate mapped reads by identifying and filtering redundant reads based on their mapping coordinates.
Repair
Restores paired-end sequencing reads after processing that may have disrupted read pairing, separating properly paired reads from singletons.
ReadQC
Quality control pipeline for sequencing data that generates HTML reports for evaluating fastq file quality.
RQCFilter
Sequencing read quality control pipeline that removes adapters, filters contaminants, trims low-quality bases, and detects microbial/vertebrate sequences.
RQCFilter2
Sequencing data quality control pipeline that removes technical artifacts, contaminants, and low-quality reads to prepare raw sequencing data for downstream analysis.

๐Ÿงฌ Assembly & Error Correction

Tadpole
Sequence analysis tool that uses k-mer frequency to assemble contigs, extend sequences, and error-correct sequencing reads.
BBNorm
Normalizes sequencing read depths by identifying and filtering k-mers to produce a consistent representation of genomic or metagenomic data.
BBCMS
Error corrects reads and/or filters by depth, storing kmer counts in a count-min sketch (a Bloom filter variant). This uses a fixed amount of memory. The error-correction algorithm is taken from T...
TadPipe
Multi-stage assembly pipeline that preprocesses sequencing reads through seven phases for genome assembly using long kmers and error correction.
TadWrapper
Automates genome assembly by systematically testing multiple kmer lengths to generate assemblies.
Consensus
Generates consensus sequences by integrating evidence from aligned reads for assembly polishing, ribosomal sequence reconstruction, and error correction.

๐Ÿ“Š Statistics & Analysis

Stats
Calculates assembly statistics like scaffold count, N50/L50, GC content, and gap percentage with constant 120MB memory usage.
Stats3
In progress. Generates some assembly stats for multiple files.
StatsWrapper
Runs stats.sh on multiple assemblies to produce one output line per file.
BBStats
BBTools-specific statistics and analysis tool.
ReadLength
Generates read length statistics for sequencing datasets, including distribution, median, mode, and variance for quality assessment.
CountGC
Analyzes nucleotide composition by calculating GC content across FASTA and FASTQ files, supporting multiple output formats.
CountDuplicates
Identifies and quantifies duplicate sequences in sequencing data to detect library preparation issues and track read redundancy.
TetramerFreq
Analyzes DNA sequence composition using sliding windows across genomes, counting and normalizing DNA k-mer frequencies.

๐Ÿ”ค Kmer Tools & Analysis

KHist
Generates kmer depth histograms from sequencing data, showing genome complexity, error rates, and unique sequence distributions.
KmerCountExact
Kmer counting tool for generating frequency histograms and genome size estimates by analyzing unique k-length subsequences.
KmerCountMulti
Estimates sequence complexity by counting unique kmers simultaneously across multiple lengths to quantify genetic diversity and sequence composition.
KmerCoverage
Annotates reads with their kmer depth to quantify sequence coverage and filter low-quality reads based on kmer frequency and base quality.
KmerFilterSet
Creates compact kmer filter sets that guarantee every input sequence contains at least one kmer for sequence filtering and matching.
KmerLimit
Generates a representative subset of sequencing reads by limiting output to a specific number of unique kmers for diverse sampling.
KmerLimit2
Uniformly subsamples sequencing reads to achieve a target number of unique k-mers for consistent dataset representation.
KmerPosition
Analyzes sequencing reads to identify positions of reference sequence kmers, showing positional biases, contamination patterns, and sequence enrichment.
CommonKmers
Analyzes sequence composition by identifying and ranking the most frequent short DNA/RNA sequence motifs for genomic patterns and quality assessment.
KCompress
Compresses sequence data into unique kmers for memory-efficient analysis by filtering, counting, and assembling sequence fragments.

๐Ÿฆ  Taxonomic Analysis & Classification

SendSketch
Identifies unknown sequences by comparing them to online reference databases across multiple taxonomic domains for taxonomic characterization.
Sketch
Creates compact, taxonomically-aware sequence representations using MinHash algorithm for genomic comparison and sequence classification.
BBSketch
Generates compact MinHash genomic sketches for comparing genome sequences, similarity analysis, taxonomic identification, and contamination detection.
CompareSketch
Compares genomic sequences across multiple samples using MinHash-based k-mer matching for genome similarity assessment, taxonomic annotation, and contamination detection.
MergeSketch
Combines multiple genomic or proteomic sketches into a single sketch for union of k-mer sets across multiple sequence datasets.
SubSketch
Reduces genome sketches to a smaller size while preserving key genomic information for computational analysis and storage.
SketchBlacklist
Identifies and creates a sketch-based blacklist of kmers that frequently occur across multiple sequences or taxa for genomic filtering and contamination detection.
SketchBlacklist2
Generates genomic blacklist sketches by identifying commonly occurring k-mers across multiple sequence sketches for filtering repetitive or non-specific sequences.
AnalyzeSketchResults
Analyzes genome sketch results to generate taxonomic assessments, computing identity metrics, accuracy statistics, and correlations across taxonomic levels.
SummarizeSketch
Summarizes BBSketch results, transforming sketch comparisons into concise reports with taxonomic insights and contamination detection.
Taxonomy
Identifies and classifies organisms by name, ID, or sequence header using NCBI taxonomy databases, supporting multi-format input and taxonomic resolution.
TaxServer
HTTP server that translates NCBI taxonomy and provides sketch-based sequence comparison services for taxonomic identification and phylogenetic analysis.
TaxSize
Quantifies sequence diversity and abundance across taxonomic groups by calculating sequence lengths and counts for each taxonomic node.
TaxTree
Converts NCBI taxonomy files into a compact binary tree for taxonomic queries and hierarchical analysis across BBTools programs.
GI2TaxID
Annotates biological sequences by converting sequence headers with GI numbers, accessions, or organism names to their corresponding NCBI Taxonomy IDs for FASTA and GFF files.
GI2Ancestors
Traces taxonomic lineages by finding common ancestors for biological sequences using NCBI GI numbers for phylogenetic and evolutionary analysis.
GITable
Creates a memory-efficient lookup table for converting legacy NCBI gi numbers to taxonomy IDs for research reproducibility and legacy workflows.

๐Ÿงน Contamination Detection & Removal

Decontaminate
Decontaminates multiplexed assemblies via normalization and mapping.
RemoveHuman
Removes all reads that map to the human genome with at least 95% identity after quality trimming. Removes approximately 98.6% of human 2x150bp reads, with zero false-positives to non-animals.
RemoveHuman2
Removes all reads that map to the human genome with at least 88% identity after quality trimming. This is more aggressive than removehuman.sh and uses an unmasked human genome reference. It removes...
RemoveCatDogMouseHuman
Removes all reads that map to the cat, dog, mouse, or human genome with at least 95% identity after quality trimming. Removes approximately 98.6% of human 2x150bp reads, with zero false-positives t...
RemoveMicrobes
Removes all reads that map to selected common microbial contaminant genomes. Removes approximately 98.5% of common contaminant reads, with zero false-positives to non-bacteria. NOTE! This program u...
CrossContaminate
Generates synthetic cross-contaminated files from clean files. Intended for use with synthetic reads generated by SynthMDA or RandomReads.
SummarizeContam
Summarizes monthly contam files into a single file. This is for internal JGI use.

๐Ÿ”„ File Format Conversion & Manipulation

Reformat
Converts, trims, filters, and processes sequencing reads across multiple formats (FASTQ, FASTA, SAM, BAM) with quality control, format conversion, and sampling options.
ReformatPB
Processes PacBio sequencing data with Zero Mode Waveguide (ZMW) awareness for filtering, sampling, and quality control of single-molecule real-time sequencing reads.
Cat
Concatenates multiple genomic sequence files into a single output, supporting various formats and compression levels.
Unzip
File compression and decompression utility for multiple formats including gzip, bzip2, and other compressed files.
VCF2GFF
Converts genomic variant data between VCF and GFF3 formats for genome annotation, comparative genomics, and data visualization.
GBFF2GFF
Converts GenBank flat files (GBFF) to GFF3 format, extracting genomic feature annotations like CDS, rRNA, tRNA, and gene coordinates for genome analysis and visualization.
Phylip2FASTA
Converts interleaved phylip sequence files to standard FASTA format for phylogenetic data analysis.
CG2Illumina
Converts BGI/Complete Genomics read headers to standard Illumina format for compatibility with bioinformatics pipelines that expect Illumina-style sequencing read headers.
MatrixToColumns
Transforms paired identity matrices into a two-column format for correlation analysis, visualization, and comparison of matrix entries across different datasets.
TextFile
Extracts and processes line ranges from text files, supporting file, stdin, and compressed input sources.

๐ŸŽ›๏ธ Filtering Tools

FilterByName
Filters sequence reads by name for data subset extraction, contamination removal, and quality control across fasta, fastq, and SAM formats with configurable matching strategies.
FilterBySequence
Filters DNA/RNA sequences by including or excluding sequences based on exact or approximate matches to reference sequences, for contamination removal and target enrichment.
FilterByTaxa
Filters DNA/RNA sequences by taxonomic criteria to extract or exclude sequences from specific organisms, taxonomic levels, or groups using NCBI taxonomy identifiers.
FilterByTile
Removes low-quality sequencing data from Illumina flowcells by assessing micro-tile regions across multiple quality metrics to filter out unreliable reads before downstream analysis.
FilterByCoverage
Removes low-quality, poorly assembled, or contaminated contigs from genome assemblies by filtering based on coverage depth, read support, and coverage consistency.
FilterBarcodes
Filters and validates multiplexed sequencing reads by barcode quality, removing low-quality samples to prevent downstream analysis errors and generates quality assessment metrics.
FilterLines
Filters text files by including or excluding lines based on matching criteria for text data extraction and cleanup across various scientific and data processing workflows.
FilterQC
Sequencing data preprocessing pipeline that removes adapters, contaminants, and low-quality reads to prepare raw sequencing data for downstream genomic analysis.
FilterSAM
Removes unreliable reads with unsupported variants from SAM/BAM files by filtering out likely sequencing errors using variant evidence.
FilterSilva
Cleans Silva database sequences by removing misclassified bacteria, eukaryotic organellar sequences, and taxonomically ambiguous entries to improve microbial sequence dataset accuracy.
FilterSubs
SAM/BAM file filtering tool for identifying reads with substitution errors to diagnose sequencing quality score calibration issues and detect base-calling anomalies.
FilterVCF
Filters variant call files (VCFs) by position, type, and quality to isolate high-confidence genomic variants for downstream genetic analysis and research.
FilterAssemblySummary
Filters NCBI assembly summaries by taxonomy using taxonomic trees and ID-based filtering across multiple taxonomic levels and hierarchical promotion.
NetFilter
Scores sequences using a neural network. Multithreaded with filtering options for sequence classification.
PolyFilter
Removes sequencing artifacts by detecting and filtering artificial homopolymers using analysis of read entropy, depth, quality, and polymeric content to improve read quality for downstream genomic analysis.
PostFilter
Filters genome assemblies by removing low-quality, suspicious contigs through coverage analysis, reducing misassembly rates and improving assembly reliability.
EstherFilter
BLASTs queries against reference, and filters out hits with scores less than 'cutoff'. The score is taken from column 12 of the BLAST output. The specific BLAST command is: blastall -p blastn -i ...

๐Ÿงช Variant Calling & Analysis

CallVariants
Genetic variant detection tool that identifies single nucleotide variants, insertions, and deletions from aligned sequencing reads, with filtering and multi-sample support.
CallVariants2
Multi-sample variant discovery tool for processing multiple genome samples simultaneously, generating population-level variant calling with independent variant detection and unified genotype reporting.
ApplyVariants
Mutates reference genomes by applying genetic variants, creating consensus sequences while resolving variant interactions and filtering by genomic coverage.
CompareVCF
Identifies unique, common, and shared genetic variants across multiple VCF files through set operations like subtraction, union, and intersection for comparative genomic analysis.
Pileup
Calculates coverage statistics for genomic data to analyze read depth, distribution, and mapping characteristics across scaffolds and individual bases in SAM/BAM files.
Pileup2
Multi-threaded coverage analysis tool for simultaneous processing of multiple SAM/BAM files, supporting genomic data exploration across various sequencing experiments.
CalcTrueQuality
Calculates actual base-call accuracy from mapped sequencing reads and generates quality score recalibration matrices to improve confidence in sequencing data quality.

๐Ÿ”ง SAM/BAM Processing

SplitSAM
Separates SAM alignment files into three files: plus-mapped reads, minus-mapped reads, and unmapped reads for strand-specific downstream analysis.
SplitSAM4Way
Categorizes SAM alignment file reads into plus-strand, minus-strand, chimeric, and unmapped reads for mapping analysis and downstream processing.
SplitSAM6Way
Categorizes paired-end sequencing reads by mapping status and strand orientation for genomic analysis and library quality assessment.
MergeSAM
Concatenates multiple SAM alignment files into a single file, preserving the first file's header while handling genomic data merging scenarios.
StreamSAM
Converts SAM/BAM to FASTQ with multi-threaded filtering for read extraction through selection criteria.
SAMToROC
Generates Receiver Operating Characteristic (ROC) curves for mapping accuracy by analyzing synthetic read alignments, providing performance metrics for genomic mapping tools.
GradeSAM
Validates mapping accuracy of synthetic read alignments by comparing predicted positions against known true positions to quantify mapper performance across different genomic mapping tools.

โœ‚๏ธ Sequence Manipulation

BBSplit
Maps reads to multiple reference sequences, separating and assigning reads to specific references while handling ambiguous mappings.
SplitByTaxa
Separates biological sequences into distinct files based on their taxonomic classifications to isolate and analyze specific taxonomic groups from metagenomic datasets.
SplitNextera
Separates Nextera long mate pair (LMP) sequencing libraries into distinct read categories: long mate pairs, fragments, unknown pairs, and singletons by detecting and processing adapter sequences.
SplitRibo
Separates mixed ribosomal RNA sequences into distinct files by type (16S, 18S, 5S, 23S) from databases for taxonomic and phylogenetic analysis.
BBSplitPairs
Filters paired-end sequencing reads by length and optionally quality, separating valid read pairs from singletons and discarding reads too short to be useful after trimming.
Shred
Breaks large genomic sequences into smaller, potentially overlapping fragments for downstream analysis, supporting length distributions and quality score handling.
Mutate
Generates mutant genome variants with control over substitution, insertion, and deletion rates for testing, simulation, and variant caller evaluation.
KMutate
Generates kmer variant sets with controlled mutations for sequence analysis and filtering of barcodes, oligos, and genomic studies.
CutPrimers
Extracts specific genomic regions between primer sequences in sequencing data for amplicon analysis by removing or preserving primer-flanked sequences from PCR products or metagenomic samples.
CutGFF
Extracts, filters, and validates specific genomic features from genome files using GFF annotation files for sequence selection with quality control and taxonomic integration.
TrimContigs
Removes low-coverage regions from genome assemblies, breaking or trimming contigs to retain only well-supported genomic sequences with high confidence.
Translate6Frames
Translates nucleotide sequences into all 6 possible protein reading frames, or converts amino acids back to canonical nucleotides, supporting both single and paired-end sequencing data formats.

๐Ÿ“‹ Sorting & Organization

BBSort
Read sorting tool that organizes sequencing data by name, length, quality, position, or taxonomy, with memory management for large datasets.
SortByName
Read sorting utility that organizes sequencing data by multiple criteria (name, length, quality, position) while handling large datasets with memory management.
MergeSorted
Merges and sorts partial sorting results across multiple temporary files, supporting sorting strategies for genomic data processing when initial sorting was interrupted.
Shuffle
Randomly reorders sequence reads while preserving read pairing for reproducible sampling and supporting sorting strategies like name, coordinate, or sequence-based organization.
Shuffle2
Randomly reorders sequencing reads while preserving paired-end relationships, with external memory management for handling large datasets beyond RAM limitations.
Partition
Splits a sequence file evenly into multiple files.

๐ŸŽญ Read Simulation & Generation

RandomReads
Generates synthetic genomic reads for benchmarking bioinformatics tools, testing analysis pipelines, and simulating sequencing scenarios like single-cell, metagenomic, and long-read datasets.
RandomReadsMG
Generates synthetic metagenomic sequencing reads with coverage, error profiles, and taxonomic diversity to benchmark and validate bioinformatics analysis tools across multiple sequencing platforms.
RandomGenome
Generates random genome or protein sequences for testing bioinformatics tools, benchmarking algorithms, and creating controlled datasets for scientific research and method development.
BBFakeReads
Generates synthetic read pairs by extracting ends from input contigs or sequences for simulation of mate-pair or paired-end sequencing libraries for testing and validation purposes.
MakeChimeras
Generates synthetic chimeric sequences from nonchimeric reads to create controlled test datasets for validating sequencing analysis tools, especially for PacBio read processing.
MakeContaminatedGenomes
Generates synthetic chimeric genomes with contamination to create test datasets for bioinformatics tools, metagenomics research, and horizontal gene transfer simulations.
MakePolymers
Generates synthetic genomic sequences with enumerated k-mers for tool testing, algorithm validation, and low-complexity sequence creation for bioinformatics research.
SynthMDA
Simulates single-cell genomic sequencing data by generating synthetic reads with the uneven coverage distribution caused by Multiple Displacement Amplification (MDA) techniques.

๐Ÿท๏ธ Barcode & Demultiplexing

DemuxByName
Demultiplexes sequencing reads into multiple output files by parsing read names, barcodes, tiles, or headers for sample separation in high-throughput sequencing data.
MuxByName
Combines multiple sequencing files into a single file by adding source-specific prefixes to read names, preserving read metadata while consolidating multiple samples or datasets.
CountBarcodes
Identifies, counts, and validates unique barcodes in sequencing reads, providing quality control metrics for demultiplexing and experimental integrity verification.
CountBarcodes2
Analyzes and quantifies barcode frequencies in sequencing reads for sample tracking, cross-contamination detection, and error-tolerant barcode assignment across sequencing workflows.
MergeBarcodes
Concatenates barcodes and quality onto read names for demultiplexing workflows
RemoveBadBarcodes
Filters sequencing reads by removing entries with invalid barcode characters for clean data in downstream bioinformatics processing.
RemoveSmartbell
Removes Smart Bell adapters from PacBio sequencing reads using alignment algorithms to detect and split or mask adapter sequences while preserving read quality.
NovaDemux
Demultiplexes sequencer reads into multiple files based on barcodes, using statistical analysis to accurately assign reads to sample libraries while minimizing crosstalk and errors in sequencing data.
DemuxServer
Starts a multi-threaded HTTP server for probabilistic barcode demultiplexing using maximum likelihood algorithms for genomic data assignment.

๐Ÿท๏ธ Renaming & Header Manipulation

Rename
Renames reads to _ where you specify the prefix and the numbers are ordered. Supports multiple renaming modes including coordinate-based, insert-size based, and custom trimming oper...
BBRename
bbrename.sh is an alias for rename.sh. This tool renames reads to _ format or other renaming modes.
RenameByMapping
Renames genomic contigs by appending coverage and taxonomic information from SAM/BAM mapping files for annotation of metagenome assemblies in downstream bioinformatics analysis.
RenameBySketch
Renames genome assembly or metagenome files with their taxonomic ID for file organization and metagenome binning validation using MinHash sketches.
RenameIMG
Renames Integrated Microbial Genomes (IMG) sequence records by prefixing headers with taxonomic and IMG identifiers for sequence tracking and taxonomic annotation in genomic datasets.
RenameRef
Converts reference sequence names across genomics file formats (SAM, BAM, FASTA, VCF, GFF) for standardization of genome reference naming conventions between different databases and research contexts.
ReplaceHeaders
Replaces sequence read headers in FASTA/FASTQ files for header management across bioinformatics workflows like renaming, tracking, or standardizing sequence metadata.

๐Ÿ”ฌ Specialized Analysis Tools

CallGenes
Finds orfs and calls genes in unspliced prokaryotes. This includes bacteria, archaea, viruses, and mitochondria. Can also predict 16S, 18S, 23S, 5S, and tRNAs.
CallPeaks
Calls peaks from a 2-column (x, y) tab-delimited histogram. Designed primarily for analyzing k-mer frequency histograms to estimate genome characteristics including size, ploidy, heterozygosity rat...
CheckStrand
Estimates RNA-seq library strandedness by analyzing k-mers, stop codons, poly-A tails, and optional reference sequences without requiring full read alignment.
Consect
Generates the conservative consensus of multiple error-correction tools. Corrections will be accepted only if all tools agree. This tool is designed for substitutions only, not indel corrections.
FindRepeats
Identifies and characterizes repetitive genomic sequences using k-mer analysis for detection of duplications, transposons, and structural variations without complex alignment.
IceCreamFinder
Finds PacBio reads containing inverted repeats. These are candidate triangle reads (ice cream cones). Either ice cream cones only, or all inverted repeats, can be filtered.
IceCreamGrader
Counts the rate of triangle reads in a file generated by IceCreamMaker with custom headers.
IceCreamMaker
Generates synthetic PacBio reads to mimic the chimeric inverted repeats from 'triangle reads', aka 'ice cream cones' - reads missing one adapter.
LilyPad
Uses mapped paired reads to generate scaffolds from contigs. Designed for use with ordinary paired-end Illumina libraries.
BBCRISPRFinder
Finds interspersed repeats contained within sequences; specifically, only information within a sequence is used. This is based on the repeat-spacer model of crisprs. Designed for short reads, but...

๐ŸŽญ Masking & Modification

BBMask
BBMask masks low-complexity and repetitive sequences in genomic data using entropy, repeat detection, and coverage-based strategies for downstream bioinformatics analysis.
AdjustHomopolymers
Corrects and simulates sequencing errors by expanding or shrinking homopolymer runs in DNA sequences for assembly and error correction in genomic data.
IndelFree
A sequence alignment tool for small query sets against large references, specializing in exact matching without insertions or deletions (indels) for tasks like CRISPR spacer mapping and primer validation.
FixGaps
Corrects scaffold assembly gaps by using paired read insert size information to estimate and resize incorrectly-sized N-character regions for genome assembly accuracy.

๐Ÿงฌ Ribosomal RNA Tools

AddSSU
Adds, removes, or replaces SSU sequence of existing sketches. Sketches and SSU fasta files must be annotated with TaxIDs.
MergeRibo
Consolidates multiple SSU (16S/18S) rRNA sequences from different sources into a single, representative sequence per taxonomic ID for non-redundant taxonomic reference databases in microbial and ecological research.
CompareSSU
Compare SSU ribosomal RNA sequences across taxonomic levels, revealing evolutionary relationships and sequence identities using multi-threaded sequence alignment.
ReduceSilva
Simplifies large taxonomic databases by reducing Silva sequence entries to a single representative per taxonomic group for phylogenetic analysis and reference database creation.

๐Ÿ—„๏ธ Database Tools

FetchProks
Writes a shell script to download one genome assembly and gff per genus or species, from ncbi. Attempts to select the best assembly on the basis of contiguity.
AnalyzeAccession
Analyzes storage of taxonomic mapping files by examining accession identifier patterns and calculating compression strategies.
ShrinkAccession
Reduces size and improves loading speed of large taxonomic mapping files by removing unnecessary columns from accession2taxid tables for storage and processing of genomic identifier databases.
FungalRelease
Reformats a fungal assembly for release. Also creates contig and agp files.
CladeLoader
Loads fasta files with TID-labeled contigs to produce Clade record output with kmer frequencies and taxonomic analysis.
ExplodeTree
Constructs a directory and file tree of sequences corresponding to a taxonomic tree.

๐Ÿ“ Coverage & Depth Analysis

SummarizeCoverage
Summarizes coverage information from basecov files created by pileup.sh. They should be named like 'sample1_basecov.txt' but other naming styles are fine too.
TileDump
Tile processing tool for Illumina sequencing data that filters, analyzes, and refines flow cell tiles for downstream genomic analysis accuracy by detecting and removing low-quality or problematic sequencing tiles.

๐Ÿ“ˆ Plotting & Visualization

PlotFlowcell
Identify and analyze low-quality regions in sequencing flowcells, filtering problematic reads to improve overall data quality and research reliability.
PlotGC
Analyzes genomic sequences by calculating GC content across fixed-length intervals, allowing researchers to identify compositional variations, sequence quality, and potential genomic anomalies.
PlotHist
Generates detailed histograms from numeric data files, converting multi-column datasets into individual frequency distribution files for statistical analysis and visualization.
PlotReadPosition
Extracts and analyzes positional and barcode data from Illumina sequencing reads, helping researchers detect spatial biases, validate barcode quality, and diagnose potential sequencing errors.
VisualizeAlignment
Transform complex text-based alignment exploration maps from bioinformatics aligners into visual bitmap images, allowing researchers to quickly interpret and analyze alignment scoring and patterns.

๐Ÿ”— Hi-C & Special Formats

ProcessHi-C
Identifies and trims junction sites in Hi-C mapped reads, characterizing chromatin interaction breakpoints and junction motifs for genomic structural analysis.
ProcessFrag
Transforms raw BBMerge script output into a standardized, tab-delimited format for data analysis, facilitating performance metrics extraction from bioinformatics tool comparisions.

โš–๏ธ Comparison & Analysis

CompareGFF
Compare gene prediction files to evaluate annotation accuracy for CDS, rRNA, and tRNA features, calculating true/false positives and providing statistical analysis of genomic annotation quality.
CompareLabels
Compares delimited labels in read headers to count how many match. The 'unknown' label is a special case. The original goal was to measure the differences between demultiplexing methods. Labels c...
CountSharedLines
Quantifies shared lines between text file sets, allowing content comparison, data validation, and set intersection analysis across genomics, research, and text processing domains.
IDMatrix
Generates an all-to-all sequence similarity matrix, revealing genetic relationships and divergence across multiple sequences with configurable alignment parameters.
IDTree
Converts sequence identity matrices into phylogenetic trees, allowing researchers to visualize and analyze evolutionary relationships between sequences through hierarchical clustering.
AllToAll
Compute sequence similarity by generating a complete identity matrix that reveals pairwise relationships across all input sequences, allowing comparative genomic analysis.
AlignRandom
Calculates random DNA sequence alignment identity distributions to establish statistical null models for comparing biological sequence similarity across different sequence lengths.

๐Ÿ“ฆ Binning & Clustering

QuickBin
Bins contigs using coverage and kmer frequencies for metagenome assembly analysis. Supports multiple sam files for improved accuracy and uses neural networks for binning decisions.
MakeQuickBinVector
Generates machine learning training vectors for QuickBin genomic binning by extracting and comparing contig features like tetranucleotide frequencies, coverage depths, and GC content to classify metagenomic contigs.
GradeBins
Systematically assesses metagenome bin quality by calculating completeness and contamination, providing a standardized multi-tier ranking system for genomic bin reliability across different taxonomic groups.
CrossBlock
CrossBlock is an alias for decontaminate.sh - a tool for removing contaminants and normalizing coverage by cross-blocking.
SummarizeCrossBlock
Summarizes CrossBlock results. Used for testing and validating CrossBlock.
QuickClade
Assigns taxonomy to query sequences by comparing kmer frequencies to those in a reference database. Developed for taxonomic assignment of metagenomic bins, but it can also run on a per-sequence bas...
Representative
Condenses large genomic datasets by generating a minimal representative set of taxa, selecting centroids that capture the diversity of all input sequences with customizable filtering.

๐Ÿ”€ Merging Tools

GradeMerge
Evaluates the accuracy of read merging by comparing merged synthetic reads against their known insert sizes, helping researchers validate their read merging algorithms and quality.
SummarizeMerge
Summarizes the output of GradeMerge for comparing read-merging performance.
MergeOTUs
Consolidates coverage statistics for identical Operational Taxonomic Units (OTUs), supporting metagenomic analysis by merging fragmented sequence data into unified taxonomic summaries.
MergePGM
Merges .pgm files used by prokaryotic gene calling tools. Supports weighted merging with normalization and multiplier options.
TagAndMerge
Consolidates and standardizes sequencing reads from demultiplexed samples by extracting barcodes, merging files, and preparing data for downstream genomic analysis and method comparison.

๐Ÿ”— Adapter & Primer Tools

AddAdapters
Simulates adapter contamination in sequencing reads for benchmarking adapter removal tools and evaluating bioinformatics trimming methods.

๐ŸŒธ Bloom Filter Tools

BloomFilter
Filters sequencing reads by matching k-mers against a reference, allowing contamination removal, depth filtering, and error correction with memory-efficient probabilistic matching.
BloomFilterParser
Extracts and tabulates detailed metrics from bloomfilter.sh verbose output, supporting reproducibility of bloom filter research results.

๐Ÿงช Testing & Benchmarking

TestAligners
Benchmarks multiple sequence alignment algorithms, comparing performance metrics like speed, accuracy, and computational efficiency across various alignment strategies.
TestAligners2
Benchmarks multiple sequence alignment algorithms by generating random sequences with controlled nucleotide identity to compare performance across diverse evolutionary similarity levels.
TestFilesystem
Benchmarks filesystem performance for scientific computing by measuring I/O speed, metadata operations, and directory listing to optimize storage systems for large-scale data analysis.
TestFormat
Identifies and characterizes bioinformatics file formats, detecting compression, quality encoding, interleaving, and read length across multiple file types.
TestFormat2
Sequence file analyzer that determines file format, quality metrics, base composition, and statistical characteristics to support bioinformatics pipeline planning and quality control.
DiskBench
Measures disk I/O performance through multithreaded read/write tests, allowing researchers and system administrators to quantify storage system capabilities and bottlenecks.
ProcessSpeed
Converts Linux time command output to decimal seconds for performance analysis and benchmarking of computational workflows.

๐Ÿ’พ Memory & Resource Tools

CalcMem
Calculates system memory for BBTools scripts, parsing Java memory parameters and detecting available RAM across diverse computational environments.
MemDetect
Automatically detects and allocates Java memory across different computing environments, handling system constraints and supporting HPC schedulers with platform-specific memory estimation.
LogLog
Estimates unique kmer count in sequencing data using the LogLog algorithm, allowing genomic diversity assessment with low memory overhead.

๐ŸŽฏ Quality Assessment

KapaStats
Detects cross-contamination between sequencing plate wells by analyzing Kapa adapter sequences, helping researchers identify and quantify unintended molecular tag mixing in genomic libraries.
SummarizeQuast
Consolidates multiple Quast genome assembly reports, allowing comparative statistical analysis and visualization of assembly metrics across different genomic datasets.
SummarizeScafStats
Consolidates BBMap scaffold statistics across multiple sequencing libraries to detect cross-contamination and quantify read mapping across different organism scaffolds.
SummarizeSeal
Processes Seal mapping stats to generate contamination summaries across multiple libraries, allowing cross-organism and cross-sample contamination analysis.
ScoreSequence
Applies neural network scoring to biological sequences, allowing sequence filtering, annotation, and quality assessment using machine learning-based sequence characterization.

๐Ÿ”ง Specialized Utilities

GetReads
Extracts specific reads by their numeric ID from sequencing files, allowing subsampling and selection of reads for downstream genomic analysis.
PickSubset
Selects a diverse subset of genomic files based on pairwise sequence similarity, reducing redundancy in large genomic datasets without using taxonomic information.
LoadReads
Diagnostic tool for measuring memory consumption of sequencing datasets, providing detailed insights into read storage efficiency and performance characteristics for bioinformatics workflows.
KeepBestCopy
Filters ribosomal gene sequences to retain the highest-quality representative copy per taxonomic identifier, improving downstream genomic and metagenomic analyses.
CopyFile
Provides file copying and recompression capabilities for scientific data, allowing format conversion, compression benchmarking, and consistent file processing across bioinformatics workflows.
PrintTime
Provides a lightweight file-based timing utility for measuring execution intervals and performance checkpoints in bioinformatics workflows and shell scripts.
InvertKey
Reconstructs original k-mer sequences from genomic sketch hash values, allowing sequence identification and reverse lookup in genomic datasets.
Unicode2ASCII
Attempts to convert unicode and control characters to printable ASCII, with significant limitations in character mapping and information preservation.
ReduceColumns
Extracts and reduces specific columns from tab-delimited files, allowing data subsetting for machine learning, bioinformatics, and data preprocessing workflows.
SeqToVec
Transforms biological sequences into machine learning-ready vectors using one-hot encoding or k-mer frequency spectra for computational genomics and predictive modeling.

๐Ÿค– Machine Learning & AI

Train
Trains and evaluates multi-layer neural networks for binary and multi-class machine learning tasks, supporting custom architectures, activation functions, and adaptive training strategies.
RunHMM
Processes HMMER search output files, extracting and organizing protein hit details by parsing 23 fields per line to generate concise, length-filtered protein summaries.

โš™๏ธ System & Configuration

BBVersion
Version checking utility for BBTools bioinformatics suite, allowing tracking of software version in genomic research pipelines and reproducible computational analyses.
JavaSetup
Configures Java runtime environment for BBTools, managing memory allocation, performance settings, and path optimization across different computing platforms.
WebCheck
Analyzes web server log files to generate performance metrics, tracking response times, status codes, and identifying potential server reliability issues.

๐Ÿ”จ Development & Special Tools

BBEst
Analyzes EST mapping efficiency by processing SAM files, categorizing expressed sequence tag capture across assemblies and quantifying mapping quality with detailed intron and scaffold analysis.
BBCountUnique
Quantifies sequence library complexity by analyzing kmer uniqueness, helping detect PCR duplication, sequencing bias, and improve genomic data collection.
Fuse
Concatenates genomic sequences or paired-end reads into longer fragments, supporting flexible padding and length control for sequence assembly and analysis workflows.
A_Sample_MT
A template for creating multi-threaded read processing tools, providing a robust framework for developing parallel sequence data manipulation applications in BBTools.
AnalyzeGenes
Generates prokaryotic gene models by analyzing fasta and GFF files to identify coding and non-coding gene sequences with multi-type recognition.