All BBTools - Complete Tool List

BBMap

Splice-aware read alignment for DNA/RNA sequencing data. Handles very long indels in Illumina-length reads.

BBMapSkimmer

Alignment tool for finding and retaining multiple mapping sites in long PacBio reads for repetitive sequence analysis.

BBWrap

Batch alignment tool that maps multiple read files to the same reference without redundant index reloading

MapPacBio

BBMap variant for aligning noisy PacBio long reads with high error tolerance for sequencing reads with 10-15% error rates.

BBRealign

Realigns mapped reads to improve alignment accuracy for variant detection and assembly quality.

Seal

Alignment-free sequence assignment using k-mer matching. Useful for RNA-seq, contamination detection, abundance profiles, demultiplexing.

BandedAligner

Pairwise sequence alignment tool that aligns query sequences to reference sequences, outputting identity and alignment positions.

BandedPlusAligner

Tool for calculating exact Average Nucleotide Identity (ANI) between two sequences.

CrossCutAligner

Sequence alignment tool that calculates exact identity scores by processing alignment matrices using antidiagonal techniques.

DriftingAligner

Aligns DNA sequences using an adaptive dynamic programming algorithm that calculates sequence identity and alignment coordinates.

DriftingPlusAligner

Sequence alignment tool that aligns sequences using an adaptive banded dynamic programming algorithm for high-identity comparisons.

GlocalAligner

Aligns query sequences to reference sequences by exploring entire alignment matrices and outputting alignment statistics.

MicroAlign

Aligns reads to small, single-contig references like PhiX, plasmids, or viral genomes using specialized indexing.

MSA

Aligns query sequences to references, identifying the best match position for each reference sequence.

Parallelogram

Converts a parallelogram-shaped alignment visualization to a rectangle for processing by visualizealignment.sh.

QuabbleAligner

Sequence alignment tool that calculates nucleotide identity and alignment positions without full matrix traceback.

QuantumAligner

Pairwise sequence alignment tool that uses sparse matrix exploration to determine identity, start, and stop locations

ScrabbleAligner

Aligns sequences using adaptive banding with dynamic bandwidth adjustment. Calculates exact ANI without traceback, with band center drifting toward highest scores.

SmithWaterman

Aligns a query sequence to a reference using Smith-Waterman algorithm for optimal local alignment.

WavefrontAligner

Java implementation of Wavefront Aligner (Marco-Sola et al.) for sequence alignment visualization and educational use

WaveFrontAlignerViz

Aligns a query sequence to a reference using WaveFront alignment algorithm and optionally generates state space exploration maps for visualization.

WobbleAligner

Performs sequence alignment for analysis of high-identity genomic sequences

WobblePlusAligner

Aligns nucleotide sequences with high identity, calculating match statistics and supporting performance benchmarking

XDropHAligner

Aligns a query sequence to a reference using XDropHAligner. Outputs identity, rstart, and rstop positions. Optionally prints a state space exploration map for visualization.

BBDuk

Quality trimming, adapter removal, and contamination filtering for sequencing reads using k-mer matching.

BBMerge

Merges paired sequencing reads by detecting overlapping regions, with read joining, quality trimming, and adapter removal capabilities.

BBMerge-Auto

Automates memory management for read merging and error correction, supporting Tadpole-based read extension and kmer-based error correction without manual memory tuning.

Clumpify

Sequence preprocessing tool that groups similar reads together for compression, error correction, and optical duplicate removal.

Dedupe

Removes redundant sequences by identifying exact, near, and partial matches across input files.

Dedupe2

Removes duplicate sequences with clustering and overlap detection, supporting unlimited kmer prefix/suffix mapping for duplicate identification.

DedupeByMapping

Removes duplicate mapped reads by identifying and filtering redundant reads based on their mapping coordinates.

Repair

Restores paired-end sequencing reads after processing that may have disrupted read pairing, separating properly paired reads from singletons.

ReadQC

Quality control pipeline for sequencing data that generates HTML reports for evaluating fastq file quality.

RQCFilter

Sequencing read quality control pipeline that removes adapters, filters contaminants, trims low-quality bases, and detects microbial/vertebrate sequences.

RQCFilter2

Sequencing data quality control pipeline that removes technical artifacts, contaminants, and low-quality reads to prepare raw sequencing data for downstream analysis.

Tadpole

Sequence analysis tool that uses k-mer frequency to assemble contigs, extend sequences, and error-correct sequencing reads.

BBNorm

Normalizes sequencing read depths by identifying and filtering k-mers to produce a consistent representation of genomic or metagenomic data.

BBCMS

Error corrects reads and/or filters by depth, storing kmer counts in a count-min sketch (a Bloom filter variant). This uses a fixed amount of memory. The error-correction algorithm is taken from T...

TadPipe

Multi-stage assembly pipeline that preprocesses sequencing reads through seven phases for genome assembly using long kmers and error correction.

TadWrapper

Automates genome assembly by systematically testing multiple kmer lengths to generate assemblies.

Reassemble

Assembles multiple genomes individually using Tadpole and concatenates the results. Prevents chimeric contigs from co-assembly, ideal for evaluating metagenomic binning tools with ground-truth labeled datasets.

Consensus

Generates consensus sequences by integrating evidence from aligned reads for assembly polishing, ribosomal sequence reconstruction, and error correction.

Stats

Calculates assembly statistics like scaffold count, N50/L50, GC content, and gap percentage with constant 120MB memory usage.

Stats3

In progress. Generates some assembly stats for multiple files.

StatsWrapper

Runs stats.sh on multiple assemblies to produce one output line per file.

BBStats

BBTools-specific statistics and analysis tool.

CloudPlot

Visualizes 3D compositional metrics (GC, HH, CAGA) as 2D scatter plots with color encoding. Supports TSV input or automatic FASTA metric calculation.

ReadLength

Generates read length statistics for sequencing datasets, including distribution, median, mode, and variance for quality assessment.

ScalarIntervals

Calculates compositional scalar metrics from nucleotide sequence data using sliding windows or global analysis.

Scalars

Calculates compositional scalar metrics from nucleotide sequence data. Computes GC-independent metrics (HH, CAGA, strandedness, etc.) either globally or using a sliding window.

CountDuplicates

Identifies and quantifies duplicate sequences in sequencing data to detect library preparation issues and track read redundancy.

CountGC

Analyzes nucleotide composition by calculating GC content across FASTA and FASTQ files, supporting multiple output formats.

TetramerFreq

Analyzes DNA sequence composition using sliding windows across genomes, counting and normalizing DNA k-mer frequencies.

KHist

Generates kmer depth histograms from sequencing data, showing genome complexity, error rates, and unique sequence distributions.

KmerCountExact

Kmer counting tool for generating frequency histograms and genome size estimates by analyzing unique k-length subsequences.

KmerCountMulti

Estimates sequence complexity by counting unique kmers simultaneously across multiple lengths to quantify genetic diversity and sequence composition.

KmerCountShort

Counts the number of unique kmers in a file and prints a fasta or tsv file containing all kmers and their counts. Supports K=1 to 15, though values above 8 should use KmerCountExact.

KmerCoverage

Annotates reads with their kmer depth to quantify sequence coverage and filter low-quality reads based on kmer frequency and base quality.

KmerFilterSet

Creates compact kmer filter sets that guarantee every input sequence contains at least one kmer for sequence filtering and matching.

KmerLimit

Generates a representative subset of sequencing reads by limiting output to a specific number of unique kmers for diverse sampling.

KmerLimit2

Uniformly subsamples sequencing reads to achieve a target number of unique k-mers for consistent dataset representation.

KmerPosition

Analyzes sequencing reads to identify positions of reference sequence kmers, showing positional biases, contamination patterns, and sequence enrichment.

CommonKmers

Analyzes sequence composition by identifying and ranking the most frequent short DNA/RNA sequence motifs for genomic patterns and quality assessment.

KCompress

Compresses sequence data into unique kmers for memory-efficient analysis by filtering, counting, and assembling sequence fragments.

SendClade

Sends taxonomic queries to a remote QuickClade server for classification without loading the reference database locally, dramatically reducing memory requirements and improving performance for batch processing.

SendSketch

Identifies unknown sequences by comparing them to online reference databases across multiple taxonomic domains for taxonomic characterization.

Sketch

Creates compact, taxonomically-aware sequence representations using MinHash algorithm for genomic comparison and sequence classification.

BBSketch

Generates compact MinHash genomic sketches for comparing genome sequences, similarity analysis, taxonomic identification, and contamination detection.

CladeServer

High-performance HTTP server for taxonomic classification using QuickClade architecture. Loads reference database once and handles multiple client requests efficiently.

CompareSketch

Compares genomic sequences across multiple samples using MinHash-based k-mer matching for genome similarity assessment, taxonomic annotation, and contamination detection.

MergeSketch

Combines multiple genomic or proteomic sketches into a single sketch for union of k-mer sets across multiple sequence datasets.

SubSketch

Reduces genome sketches to a smaller size while preserving key genomic information for computational analysis and storage.

SketchBlacklist

Identifies and creates a sketch-based blacklist of kmers that frequently occur across multiple sequences or taxa for genomic filtering and contamination detection.

SketchBlacklist2

Generates genomic blacklist sketches by identifying commonly occurring k-mers across multiple sequence sketches for filtering repetitive or non-specific sequences.

AnalyzeSketchResults

Analyzes genome sketch results to generate taxonomic assessments, computing identity metrics, accuracy statistics, and correlations across taxonomic levels.

SummarizeSketch

Summarizes BBSketch results, transforming sketch comparisons into concise reports with taxonomic insights and contamination detection.

Taxonomy

Identifies and classifies organisms by name, ID, or sequence header using NCBI taxonomy databases, supporting multi-format input and taxonomic resolution.

TaxServer

HTTP server that translates NCBI taxonomy and provides sketch-based sequence comparison services for taxonomic identification and phylogenetic analysis.

TaxSize

Quantifies sequence diversity and abundance across taxonomic groups by calculating sequence lengths and counts for each taxonomic node.

TaxTree

Converts NCBI taxonomy files into a compact binary tree for taxonomic queries and hierarchical analysis across BBTools programs.

GI2TaxID

Annotates biological sequences by converting sequence headers with GI numbers, accessions, or organism names to their corresponding NCBI Taxonomy IDs for FASTA and GFF files.

GI2Ancestors

Traces taxonomic lineages by finding common ancestors for biological sequences using NCBI GI numbers for phylogenetic and evolutionary analysis.

GITable

Creates a memory-efficient lookup table for converting legacy NCBI gi numbers to taxonomy IDs for research reproducibility and legacy workflows.

Decontaminate

Decontaminates multiplexed assemblies via normalization and mapping.

RemoveHuman

Removes all reads that map to the human genome with at least 95% identity after quality trimming. Removes approximately 98.6% of human 2x150bp reads, with zero false-positives to non-animals.

RemoveHuman2

Removes all reads that map to the human genome with at least 88% identity after quality trimming. This is more aggressive than removehuman.sh and uses an unmasked human genome reference. It removes...

RemoveCatDogMouseHuman

Removes all reads that map to the cat, dog, mouse, or human genome with at least 95% identity after quality trimming. Removes approximately 98.6% of human 2x150bp reads, with zero false-positives t...

RemoveMicrobes

Removes all reads that map to selected common microbial contaminant genomes. Removes approximately 98.5% of common contaminant reads, with zero false-positives to non-bacteria. NOTE! This program u...

CrossContaminate

Generates synthetic cross-contaminated files from clean files. Intended for use with synthetic reads generated by SynthMDA or RandomReads.

SummarizeContam

Summarizes monthly contam files into a single file. This is for internal JGI use.

Reformat

Converts, trims, filters, and processes sequencing reads across multiple formats (FASTQ, FASTA, SAM, BAM) with quality control, format conversion, and sampling options.

ReformatPB

Processes PacBio sequencing data with Zero Mode Waveguide (ZMW) awareness for filtering, sampling, and quality control of single-molecule real-time sequencing reads.

FastqScan

Lightweight sequence file scanner for counting reads and bases with minimal CPU overhead. Supports FASTQ, FASTA, and SAM formats. 7× faster than needletail on bgzipped files.

FileScan

Fast lightweight file scanner that parses and counts records across multiple sequence and alignment formats (FASTQ, FASTA, SAM, SCARF, GFA, FASTG) with support for raw, gzip, bgzip, and bz2 compression.

Cat

Concatenates multiple genomic sequence files into a single output, supporting various formats and compression levels.

Cbcl2Text

Converts Illumina CBCL (Compressed Base Call) files to text format. Extracts base calls, quality scores, and flowcell coordinates from binary CBCL files. Supports automatic read structure parsing from RunInfo.xml or manual read splitting.

Unzip

File compression and decompression utility for multiple formats including gzip, bzip2, and other compressed files.

VCF2GFF

Converts genomic variant data between VCF and GFF3 formats for genome annotation, comparative genomics, and data visualization.

GBFF2GFF

Converts GenBank flat files (GBFF) to GFF3 format, extracting genomic feature annotations like CDS, rRNA, tRNA, and gene coordinates for genome analysis and visualization.

Phylip2FASTA

Converts interleaved phylip sequence files to standard FASTA format for phylogenetic data analysis.

CG2Illumina

Converts BGI/Complete Genomics read headers to standard Illumina format for compatibility with bioinformatics pipelines that expect Illumina-style sequencing read headers.

MatrixToColumns

Transforms paired identity matrices into a two-column format for correlation analysis, visualization, and comparison of matrix entries across different datasets.

TextFile

Extracts and processes line ranges from text files, supporting file, stdin, and compressed input sources.

FilterByName

Filters sequence reads by name for data subset extraction, contamination removal, and quality control across fasta, fastq, and SAM formats with configurable matching strategies.

FilterBySequence

Filters DNA/RNA sequences by including or excluding sequences based on exact or approximate matches to reference sequences, for contamination removal and target enrichment.

FilterByTaxa

Filters DNA/RNA sequences by taxonomic criteria to extract or exclude sequences from specific organisms, taxonomic levels, or groups using NCBI taxonomy identifiers.

FilterByTile

Removes low-quality sequencing data from Illumina flowcells by assessing micro-tile regions across multiple quality metrics to filter out unreliable reads before downstream analysis.

FilterByCoverage

Removes low-quality, poorly assembled, or contaminated contigs from genome assemblies by filtering based on coverage depth, read support, and coverage consistency.

FilterBarcodes

Filters and validates multiplexed sequencing reads by barcode quality, removing low-quality samples to prevent downstream analysis errors and generates quality assessment metrics.

FilterLines

Filters text files by including or excluding lines based on matching criteria for text data extraction and cleanup across various scientific and data processing workflows.

FilterQC

Sequencing data preprocessing pipeline that removes adapters, contaminants, and low-quality reads to prepare raw sequencing data for downstream genomic analysis.

FilterSAM

Removes unreliable reads with unsupported variants from SAM/BAM files by filtering out likely sequencing errors using variant evidence.

FilterSilva

Cleans Silva database sequences by removing misclassified bacteria, eukaryotic organellar sequences, and taxonomically ambiguous entries to improve microbial sequence dataset accuracy.

FilterSubs

SAM/BAM file filtering tool for identifying reads with substitution errors to diagnose sequencing quality score calibration issues and detect base-calling anomalies.

FilterVCF

Filters variant call files (VCFs) by position, type, and quality to isolate high-confidence genomic variants for downstream genetic analysis and research.

FilterAssemblySummary

Filters NCBI assembly summaries by taxonomy using taxonomic trees and ID-based filtering across multiple taxonomic levels and hierarchical promotion.

NetFilter

Scores sequences using a neural network. Multithreaded with filtering options for sequence classification.

PolyFilter

Removes sequencing artifacts by detecting and filtering artificial homopolymers using analysis of read entropy, depth, quality, and polymeric content to improve read quality for downstream genomic analysis.

PostFilter

Filters genome assemblies by removing low-quality, suspicious contigs through coverage analysis, reducing misassembly rates and improving assembly reliability.

EstherFilter

BLASTs queries against reference, and filters out hits with scores less than 'cutoff'. The score is taken from column 12 of the BLAST output. The specific BLAST command is: blastall -p blastn -i ...

CallVariants

Genetic variant detection tool that identifies single nucleotide variants, insertions, and deletions from aligned sequencing reads, with filtering and multi-sample support.

CallVariants2

Multi-sample variant discovery tool for processing multiple genome samples simultaneously, generating population-level variant calling with independent variant detection and unified genotype reporting.

ApplyVariants

Mutates reference genomes by applying genetic variants, creating consensus sequences while resolving variant interactions and filtering by genomic coverage.

CompareVCF

Identifies unique, common, and shared genetic variants across multiple VCF files through set operations like subtraction, union, and intersection for comparative genomic analysis.

Pileup

Calculates coverage statistics for genomic data to analyze read depth, distribution, and mapping characteristics across scaffolds and individual bases in SAM/BAM files.

Pileup2

Multi-threaded coverage analysis tool for simultaneous processing of multiple SAM/BAM files, supporting genomic data exploration across various sequencing experiments.

CalcTrueQuality

Calculates actual base-call accuracy from mapped sequencing reads and generates quality score recalibration matrices to improve confidence in sequencing data quality.

BamLineStreamer

Converts BAM (Binary Alignment/Map) files to SAM (Sequence Alignment/Map) text format.

SplitSAM

Separates SAM alignment files into three files: plus-mapped reads, minus-mapped reads, and unmapped reads for strand-specific downstream analysis.

SplitSAM4Way

Categorizes SAM alignment file reads into plus-strand, minus-strand, chimeric, and unmapped reads for mapping analysis and downstream processing.

SplitSAM6Way

Categorizes paired-end sequencing reads by mapping status and strand orientation for genomic analysis and library quality assessment.

MergeSAM

Concatenates multiple SAM alignment files into a single file, preserving the first file's header while handling genomic data merging scenarios.

StreamSAM

Redirects to SamStreamer - renamed tool with enhanced native BAM support and record-breaking performance.

SamStreamer

Converts between SAM, BAM, FASTA, and FASTQ formats with SAM filtering options and BAI index generation.

Stream

Universal format converter for SAM, BAM, FASTA, and FASTQ with subsampling, paired file support, and multithreading.

SAMToROC

Generates Receiver Operating Characteristic (ROC) curves for mapping accuracy by analyzing synthetic read alignments, providing performance metrics for genomic mapping tools.

GradeSAM

Validates mapping accuracy of synthetic read alignments by comparing predicted positions against known true positions to quantify mapper performance across different genomic mapping tools.

BBSplit

Maps reads to multiple reference sequences, separating and assigning reads to specific references while handling ambiguous mappings.

SplitByTaxa

Separates biological sequences into distinct files based on their taxonomic classifications to isolate and analyze specific taxonomic groups from metagenomic datasets.

SplitNextera

Separates Nextera long mate pair (LMP) sequencing libraries into distinct read categories: long mate pairs, fragments, unknown pairs, and singletons by detecting and processing adapter sequences.

SplitRibo

Separates mixed ribosomal RNA sequences into distinct files by type (16S, 18S, 5S, 23S) from databases for taxonomic and phylogenetic analysis.

BBSplitPairs

Filters paired-end sequencing reads by length and optionally quality, separating valid read pairs from singletons and discarding reads too short to be useful after trimming.

Shred

Breaks large genomic sequences into smaller, potentially overlapping fragments for downstream analysis, supporting length distributions and quality score handling.

Mutate

Generates mutant genome variants with control over substitution, insertion, and deletion rates for testing, simulation, and variant caller evaluation.

KMutate

Generates kmer variant sets with controlled mutations for sequence analysis and filtering of barcodes, oligos, and genomic studies.

CutPrimers

Extracts specific genomic regions between primer sequences in sequencing data for amplicon analysis by removing or preserving primer-flanked sequences from PCR products or metagenomic samples.

CutGFF

Extracts, filters, and validates specific genomic features from genome files using GFF annotation files for sequence selection with quality control and taxonomic integration.

TrimContigs

Removes low-coverage regions from genome assemblies, breaking or trimming contigs to retain only well-supported genomic sequences with high confidence.

Translate6Frames

Translates nucleotide sequences into all 6 possible protein reading frames, or converts amino acids back to canonical nucleotides, supporting both single and paired-end sequencing data formats.

BBSort

Read sorting tool that organizes sequencing data by name, length, quality, position, or taxonomy, with memory management for large datasets.

SortByName

Read sorting utility that organizes sequencing data by multiple criteria (name, length, quality, position) while handling large datasets with memory management.

MergeSorted

Merges and sorts partial sorting results across multiple temporary files, supporting sorting strategies for genomic data processing when initial sorting was interrupted.

Shuffle

Randomly reorders sequence reads while preserving read pairing for reproducible sampling and supporting sorting strategies like name, coordinate, or sequence-based organization.

Shuffle2

Randomly reorders sequencing reads while preserving paired-end relationships, with external memory management for handling large datasets beyond RAM limitations.

Partition

Splits a sequence file evenly into multiple files.

RandomReads

Generates synthetic genomic reads for benchmarking bioinformatics tools, testing analysis pipelines, and simulating sequencing scenarios like single-cell, metagenomic, and long-read datasets.

RandomReadsMG

Generates synthetic metagenomic sequencing reads with coverage, error profiles, and taxonomic diversity to benchmark and validate bioinformatics analysis tools across multiple sequencing platforms.

RandomGenome

Generates random genome or protein sequences for testing bioinformatics tools, benchmarking algorithms, and creating controlled datasets for scientific research and method development.

BBFakeReads

Generates synthetic read pairs by extracting ends from input contigs or sequences for simulation of mate-pair or paired-end sequencing libraries for testing and validation purposes.

MakeChimeras

Generates synthetic chimeric sequences from nonchimeric reads to create controlled test datasets for validating sequencing analysis tools, especially for PacBio read processing.

MakeContaminatedGenomes

Generates synthetic chimeric genomes with contamination to create test datasets for bioinformatics tools, metagenomics research, and horizontal gene transfer simulations.

MakePolymers

Generates synthetic genomic sequences with enumerated k-mers for tool testing, algorithm validation, and low-complexity sequence creation for bioinformatics research.

SynthMDA

Simulates single-cell genomic sequencing data by generating synthetic reads with the uneven coverage distribution caused by Multiple Displacement Amplification (MDA) techniques.

DemuxByName

Demultiplexes sequencing reads into multiple output files by parsing read names, barcodes, tiles, or headers for sample separation in high-throughput sequencing data.

MuxByName

Combines multiple sequencing files into a single file by adding source-specific prefixes to read names, preserving read metadata while consolidating multiple samples or datasets.

CountBarcodes

Identifies, counts, and validates unique barcodes in sequencing reads, providing quality control metrics for demultiplexing and experimental integrity verification.

CountBarcodes2

Analyzes and quantifies barcode frequencies in sequencing reads for sample tracking, cross-contamination detection, and error-tolerant barcode assignment across sequencing workflows.

MergeBarcodes

Concatenates barcodes and quality onto read names for demultiplexing workflows

RemoveBadBarcodes

Filters sequencing reads by removing entries with invalid barcode characters for clean data in downstream bioinformatics processing.

RemoveSmartbell

Removes Smart Bell adapters from PacBio sequencing reads using alignment algorithms to detect and split or mask adapter sequences while preserving read quality.

NovaDemux

Demultiplexes sequencer reads into multiple files based on barcodes, using statistical analysis to accurately assign reads to sample libraries while minimizing crosstalk and errors in sequencing data.

DemuxServer

Starts a multi-threaded HTTP server for probabilistic barcode demultiplexing using maximum likelihood algorithms for genomic data assignment.

Rename

Renames reads to _ where you specify the prefix and the numbers are ordered. Supports multiple renaming modes including coordinate-based, insert-size based, and custom trimming oper...

BBRename

bbrename.sh is an alias for rename.sh. This tool renames reads to _ format or other renaming modes.

RenameByMapping

Renames genomic contigs by appending coverage and taxonomic information from SAM/BAM mapping files for annotation of metagenome assemblies in downstream bioinformatics analysis.

RenameBySketch

Renames genome assembly or metagenome files with their taxonomic ID for file organization and metagenome binning validation using MinHash sketches.

RenameIMG

Renames Integrated Microbial Genomes (IMG) sequence records by prefixing headers with taxonomic and IMG identifiers for sequence tracking and taxonomic annotation in genomic datasets.

RenameRef

Converts reference sequence names across genomics file formats (SAM, BAM, FASTA, VCF, GFF) for standardization of genome reference naming conventions between different databases and research contexts.

ReplaceHeaders

Replaces sequence read headers in FASTA/FASTQ files for header management across bioinformatics workflows like renaming, tracking, or standardizing sequence metadata.

CallGenes

Finds orfs and calls genes in unspliced prokaryotes. This includes bacteria, archaea, viruses, and mitochondria. Can also predict 16S, 18S, 23S, 5S, and tRNAs.

CallPeaks

Calls peaks from a 2-column (x, y) tab-delimited histogram. Designed primarily for analyzing k-mer frequency histograms to estimate genome characteristics including size, ploidy, heterozygosity rat...

CheckStrand

Estimates RNA-seq library strandedness by analyzing k-mers, stop codons, poly-A tails, and optional reference sequences without requiring full read alignment.

Consect

Generates the conservative consensus of multiple error-correction tools. Corrections will be accepted only if all tools agree. This tool is designed for substitutions only, not indel corrections.

FindRepeats

Identifies and characterizes repetitive genomic sequences using k-mer analysis for detection of duplications, transposons, and structural variations without complex alignment.

IceCreamFinder

Finds PacBio reads containing inverted repeats. These are candidate triangle reads (ice cream cones). Either ice cream cones only, or all inverted repeats, can be filtered.

IceCreamGrader

Counts the rate of triangle reads in a file generated by IceCreamMaker with custom headers.

IceCreamMaker

Generates synthetic PacBio reads to mimic the chimeric inverted repeats from 'triangle reads', aka 'ice cream cones' - reads missing one adapter.

LilyPad

Uses mapped paired reads to generate scaffolds from contigs. Designed for use with ordinary paired-end Illumina libraries.

BBCRISPRFinder

Finds interspersed repeats contained within sequences; specifically, only information within a sequence is used. This is based on the repeat-spacer model of crisprs. Designed for short reads, but...

BBMask

BBMask masks low-complexity and repetitive sequences in genomic data using entropy, repeat detection, and coverage-based strategies for downstream bioinformatics analysis.

AdjustHomopolymers

Corrects and simulates sequencing errors by expanding or shrinking homopolymer runs in DNA sequences for assembly and error correction in genomic data.

IndelFree

A sequence alignment tool for small query sets against large references, specializing in exact matching without insertions or deletions (indels) for tasks like CRISPR spacer mapping and primer validation.

FixGaps

Corrects scaffold assembly gaps by using paired read insert size information to estimate and resize incorrectly-sized N-character regions for genome assembly accuracy.

AddSSU

Adds, removes, or replaces SSU sequence of existing sketches. Sketches and SSU fasta files must be annotated with TaxIDs.

MergeRibo

Consolidates multiple SSU (16S/18S) rRNA sequences from different sources into a single, representative sequence per taxonomic ID for non-redundant taxonomic reference databases in microbial and ecological research.

CompareSSU

Compare SSU ribosomal RNA sequences across taxonomic levels, revealing evolutionary relationships and sequence identities using multi-threaded sequence alignment.

ReduceSilva

Simplifies large taxonomic databases by reducing Silva sequence entries to a single representative per taxonomic group for phylogenetic analysis and reference database creation.

FetchProks

Writes a shell script to download one genome assembly and gff per genus or species, from ncbi. Attempts to select the best assembly on the basis of contiguity.

AnalyzeAccession

Analyzes storage of taxonomic mapping files by examining accession identifier patterns and calculating compression strategies.

ShrinkAccession

Reduces size and improves loading speed of large taxonomic mapping files by removing unnecessary columns from accession2taxid tables for storage and processing of genomic identifier databases.

FungalRelease

Reformats a fungal assembly for release. Also creates contig and agp files.

CladeLoader

Loads fasta files with TID-labeled contigs to produce Clade record output with kmer frequencies and taxonomic analysis.

ExplodeTree

Constructs a directory and file tree of sequences corresponding to a taxonomic tree.

SummarizeCoverage

Summarizes coverage information from basecov files created by pileup.sh. They should be named like 'sample1_basecov.txt' but other naming styles are fine too.

TileDump

Tile processing tool for Illumina sequencing data that filters, analyzes, and refines flow cell tiles for downstream genomic analysis accuracy by detecting and removing low-quality or problematic sequencing tiles.

PlotFlowcell

Identify and analyze low-quality regions in sequencing flowcells, filtering problematic reads to improve overall data quality and research reliability.

PlotGC

Analyzes genomic sequences by calculating GC content across fixed-length intervals, allowing researchers to identify compositional variations, sequence quality, and potential genomic anomalies.

PlotHist

Generates detailed histograms from numeric data files, converting multi-column datasets into individual frequency distribution files for statistical analysis and visualization.

PlotReadPosition

Extracts and analyzes positional and barcode data from Illumina sequencing reads, helping researchers detect spatial biases, validate barcode quality, and diagnose potential sequencing errors.

VisualizeAlignment

Transform complex text-based alignment exploration maps from bioinformatics aligners into visual bitmap images, allowing researchers to quickly interpret and analyze alignment scoring and patterns.

ProcessHi-C

Identifies and trims junction sites in Hi-C mapped reads, characterizing chromatin interaction breakpoints and junction motifs for genomic structural analysis.

ProcessFrag

Transforms raw BBMerge script output into a standardized, tab-delimited format for data analysis, facilitating performance metrics extraction from bioinformatics tool comparisions.

CompareGFF

Compare gene prediction files to evaluate annotation accuracy for CDS, rRNA, and tRNA features, calculating true/false positives and providing statistical analysis of genomic annotation quality.

CompareLabels

Compares delimited labels in read headers to count how many match. The 'unknown' label is a special case. The original goal was to measure the differences between demultiplexing methods. Labels c...

CountSharedLines

Quantifies shared lines between text file sets, allowing content comparison, data validation, and set intersection analysis across genomics, research, and text processing domains.

IDMatrix

Generates an all-to-all sequence similarity matrix, revealing genetic relationships and divergence across multiple sequences with configurable alignment parameters.

IDTree

Converts sequence identity matrices into phylogenetic trees, allowing researchers to visualize and analyze evolutionary relationships between sequences through hierarchical clustering.

AllToAll

Compute sequence similarity by generating a complete identity matrix that reveals pairwise relationships across all input sequences, allowing comparative genomic analysis.

AlignRandom

Calculates random DNA sequence alignment identity distributions to establish statistical null models for comparing biological sequence similarity across different sequence lengths.

QuickBin

Bins contigs using coverage and kmer frequencies for metagenome assembly analysis. Supports multiple sam files for improved accuracy and uses neural networks for binning decisions.

MakeQuickBinVector

Generates machine learning training vectors for QuickBin genomic binning by extracting and comparing contig features like tetranucleotide frequencies, coverage depths, and GC content to classify metagenomic contigs.

GradeBins

Systematically assesses metagenome bin quality by calculating completeness and contamination, providing a standardized multi-tier ranking system for genomic bin reliability across different taxonomic groups.

CrossBlock

CrossBlock is an alias for decontaminate.sh - a tool for removing contaminants and normalizing coverage by cross-blocking.

SummarizeCrossBlock

Summarizes CrossBlock results. Used for testing and validating CrossBlock.

QuickClade

Assigns taxonomy to query sequences by comparing kmer frequencies to those in a reference database. Developed for taxonomic assignment of metagenomic bins, but it can also run on a per-sequence bas...

Representative

Condenses large genomic datasets by generating a minimal representative set of taxa, selecting centroids that capture the diversity of all input sequences with customizable filtering.

GradeMerge

Evaluates the accuracy of read merging by comparing merged synthetic reads against their known insert sizes, helping researchers validate their read merging algorithms and quality.

SummarizeMerge

Summarizes the output of GradeMerge for comparing read-merging performance.

MergeOTUs

Consolidates coverage statistics for identical Operational Taxonomic Units (OTUs), supporting metagenomic analysis by merging fragmented sequence data into unified taxonomic summaries.

MergePGM

Merges .pgm files used by prokaryotic gene calling tools. Supports weighted merging with normalization and multiplier options.

TagAndMerge

Consolidates and standardizes sequencing reads from demultiplexed samples by extracting barcodes, merging files, and preparing data for downstream genomic analysis and method comparison.

AddAdapters

Simulates adapter contamination in sequencing reads for benchmarking adapter removal tools and evaluating bioinformatics trimming methods.

BloomFilter

Filters sequencing reads by matching k-mers against a reference, allowing contamination removal, depth filtering, and error correction with memory-efficient probabilistic matching.

BloomFilterParser

Extracts and tabulates detailed metrics from bloomfilter.sh verbose output, supporting reproducibility of bloom filter research results.

TestAligners

Benchmarks multiple sequence alignment algorithms, comparing performance metrics like speed, accuracy, and computational efficiency across various alignment strategies.

TestAligners2

Benchmarks multiple sequence alignment algorithms by generating random sequences with controlled nucleotide identity to compare performance across diverse evolutionary similarity levels.

TestFilesystem

Benchmarks filesystem performance for scientific computing by measuring I/O speed, metadata operations, and directory listing to optimize storage systems for large-scale data analysis.

TestFormat

Identifies and characterizes bioinformatics file formats, detecting compression, quality encoding, interleaving, and read length across multiple file types.

TestFormat2

Sequence file analyzer that determines file format, quality metrics, base composition, and statistical characteristics to support bioinformatics pipeline planning and quality control.

DiskBench

Measures disk I/O performance through multithreaded read/write tests, allowing researchers and system administrators to quantify storage system capabilities and bottlenecks.

ProcessSpeed

Converts Linux time command output to decimal seconds for performance analysis and benchmarking of computational workflows.

CalcMem

Calculates system memory for BBTools scripts, parsing Java memory parameters and detecting available RAM across diverse computational environments.

MemDetect

Automatically detects and allocates Java memory across different computing environments, handling system constraints and supporting HPC schedulers with platform-specific memory estimation.

Profile

Runs any BBTools program with Java Flight Recorder profiling enabled, capturing CPU usage, memory allocation, thread activity, and performance metrics for optimization analysis.

LogLog

Estimates unique kmer count in sequencing data using the LogLog algorithm, allowing genomic diversity assessment with low memory overhead.

KapaStats

Detects cross-contamination between sequencing plate wells by analyzing Kapa adapter sequences, helping researchers identify and quantify unintended molecular tag mixing in genomic libraries.

SummarizeQuast

Consolidates multiple Quast genome assembly reports, allowing comparative statistical analysis and visualization of assembly metrics across different genomic datasets.

SummarizeScafStats

Consolidates BBMap scaffold statistics across multiple sequencing libraries to detect cross-contamination and quantify read mapping across different organism scaffolds.

SummarizeSeal

Processes Seal mapping stats to generate contamination summaries across multiple libraries, allowing cross-organism and cross-sample contamination analysis.

ScoreSequence

Applies neural network scoring to biological sequences, allowing sequence filtering, annotation, and quality assessment using machine learning-based sequence characterization.

GetReads

Extracts specific reads by their numeric ID from sequencing files, allowing subsampling and selection of reads for downstream genomic analysis.

PickSubset

Selects a diverse subset of genomic files based on pairwise sequence similarity, reducing redundancy in large genomic datasets without using taxonomic information.

LoadReads

Diagnostic tool for measuring memory consumption of sequencing datasets, providing detailed insights into read storage efficiency and performance characteristics for bioinformatics workflows.

KeepBestCopy

Filters ribosomal gene sequences to retain the highest-quality representative copy per taxonomic identifier, improving downstream genomic and metagenomic analyses.

CopyFile

Provides file copying and recompression capabilities for scientific data, allowing format conversion, compression benchmarking, and consistent file processing across bioinformatics workflows.

PrintTime

Provides a lightweight file-based timing utility for measuring execution intervals and performance checkpoints in bioinformatics workflows and shell scripts.

InvertKey

Reconstructs original k-mer sequences from genomic sketch hash values, allowing sequence identification and reverse lookup in genomic datasets.

Unicode2ASCII

Attempts to convert unicode and control characters to printable ASCII, with significant limitations in character mapping and information preservation.

ReduceColumns

Extracts and reduces specific columns from tab-delimited files, allowing data subsetting for machine learning, bioinformatics, and data preprocessing workflows.

SeqToVec

Transforms biological sequences into machine learning-ready vectors using one-hot encoding or k-mer frequency spectra for computational genomics and predictive modeling.

Train

Trains and evaluates multi-layer neural networks for binary and multi-class machine learning tasks, supporting custom architectures, activation functions, and adaptive training strategies.

RunHMM

Processes HMMER search output files, extracting and organizing protein hit details by parsing 23 fields per line to generate concise, length-filtered protein summaries.

BBVersion

Version checking utility for BBTools bioinformatics suite, allowing tracking of software version in genomic research pipelines and reproducible computational analyses.

JavaSetup

Configures Java runtime environment for BBTools, managing memory allocation, performance settings, and path optimization across different computing platforms.

WebCheck

Analyzes web server log files to generate performance metrics, tracking response times, status codes, and identifying potential server reliability issues.

BBEst

Analyzes EST mapping efficiency by processing SAM files, categorizing expressed sequence tag capture across assemblies and quantifying mapping quality with detailed intron and scaffold analysis.

BBCountUnique

Quantifies sequence library complexity by analyzing kmer uniqueness, helping detect PCR duplication, sequencing bias, and improve genomic data collection.

Fuse

Concatenates genomic sequences or paired-end reads into longer fragments, supporting flexible padding and length control for sequence assembly and analysis workflows.

A_Sample_MT

A template for creating multi-threaded read processing tools, providing a robust framework for developing parallel sequence data manipulation applications in BBTools.

AnalyzeGenes

Generates prokaryotic gene models by analyzing fasta and GFF files to identify coding and non-coding gene sequences with multi-type recognition.

📌 Universal BBTools Parameters

🎯 Core Alignment & Mapping Tools

🔬 Alignment Algorithms

🔍 Quality Control & Preprocessing

🧬 Assembly & Error Correction

📊 Statistics & Analysis

🔤 Kmer Tools & Analysis

🦠 Taxonomic Analysis & Classification

🧹 Contamination Detection & Removal

🔄 File Format Conversion & Manipulation

🎛️ Filtering Tools

🧪 Variant Calling & Analysis

🔧 SAM/BAM Processing

✂️ Sequence Manipulation

📋 Sorting & Organization

🎭 Read Simulation & Generation

🏷️ Barcode & Demultiplexing

🏷️ Renaming & Header Manipulation

🔬 Specialized Analysis Tools

🎭 Masking & Modification

🧬 Ribosomal RNA Tools

🗄️ Database Tools

📏 Coverage & Depth Analysis

📈 Plotting & Visualization

🔗 Hi-C & Special Formats

⚖️ Comparison & Analysis

📦 Binning & Clustering

🔀 Merging Tools

🔗 Adapter & Primer Tools

🌸 Bloom Filter Tools

🧪 Testing & Benchmarking

💾 Memory & Resource Tools

🎯 Quality Assessment

🔧 Specialized Utilities

🤖 Machine Learning & AI

⚙️ System & Configuration

🔨 Development & Special Tools