GradeBins

Script: gradebins.sh Package: bin Class: GradeBins.java

Grades metagenome bins for completeness and contamination. The contigs can be labeled with their taxID; in which case the header should contain 'tid_X' somewhere where X is a number unique to their proper genome. Alternately, CheckM2 and/or EukCC output can be fed to it. Do not include a 'chaff' file (for unbinned contigs) when grading. Completeness Score is (sum of completeness*size)/(total size) for all bins. Contamination Score is (sum of contam*size)/(total size) for all bins. Total Score is (sum of (completeness-5*contam)^2) for all bins. Bin Definitions: UHQ: >=99% complete and <=1% contam (subset of VHQ) VHQ: >=95% complete and <=2% contam (subset of HQ) HQ: >=90% complete and <=5% contam MQ: >=50% complete and <=10% contam, but not HQ LQ: <50% complete or >10% contam VLQ: <20% complete or >5% contam (subset of LQ)

Basic Usage

gradebins.sh ref=assembly bin*.fa
gradebins.sh ref=assembly.fa in=bin_directory
gradebins.sh taxin=tax.txt in=bins

GradeBins evaluates the quality of metagenome bins by calculating completeness and contamination metrics. It can process bins with labeled contigs (containing taxID information) or integrate results from external quality assessment tools like CheckM2 and EukCC.

Parameters

Parameters are organized by their function in the bin grading process. Each parameter group serves a specific purpose in the bin quality assessment workflow.

Input parameters

ref=<file>: The original assembly that was binned. Required for calculating completeness when not using taxin. The reference assembly is used to build a size map for each taxonomic ID to determine expected genome sizes.
in=<directory>: Location of bin fastas. Can specify individual files or a directory containing multiple bin files. Supports both individual file arguments and directory scanning using Tools.getFileOrFiles() for batch processing.
checkm=<file>: Optional CheckM2 quality_report.tsv file or directory. If a directory is provided, looks for quality_report.tsv within it. CheckM2 results take precedence over internal calculations when available.
eukcc=<file>: Optional EukCC eukcc.csv file or directory. If a directory is provided, looks for eukcc.csv within it. Used for eukaryotic bin assessment and compared with CheckM2 results to select the best quality scores.
cami=<file>: Optional binning file from CAMI which indicates contig TaxIDs. Provides taxonomic labels for contigs in standardized CAMI format, overriding any taxID information parsed from contig headers.
taxin=<file>: Optional file with taxIDs and sizes (instead of loading ref). Does not need to include taxIDs. The tax file loads faster. Tab-delimited format with columns: taxID, size, contigs. Enables faster processing by avoiding assembly parsing.
gtdb=<file>: Optional gtdbtk file. Can be a single file or directory containing gtdbtk.bac120.summary.tsv and gtdbtk.ar53.summary.tsv files. Provides GTDB taxonomic classifications for lineage reporting.
gff=<file>: Optional gff file. Used for rRNA and tRNA annotation when userna=t is enabled. Provides gene annotations necessary for high-quality genome determination based on essential RNA content.
imgmap=<file>: Optional IMG map file, for renamed IMG gff input. Maps between original and renamed contig identifiers in IMG (Integrated Microbial Genomes) datasets to ensure proper GFF annotation matching.
spectra=<file>: Optional path to QuickClade index. Enables taxonomic classification using k-mer based spectra matching when quickclade=t is set. Uses pre-built reference databases for rapid taxonomic assignment.
cov=<file>: Optional path to QuickBin coverage file. Provides per-contig coverage information for depth-aware analysis and reporting. Coverage data is incorporated into bin statistics and quality assessment.
loadmt=t: Load bins multithreaded. Default: true. Enables parallel processing of multiple bins using Shared.threads() with scaling limits when threads>16 to Tools.mid(16, threads/2, 32).

Output parameters

report=<file>: Report on bin size, quality, and taxonomy. Generates tab-delimited report with columns for bin name, size, contig count, GC content, depth, completeness, contamination, taxonomic ID, quality type, and optional RNA/gene counts.
taxout=<file>: Generate a tax file from the reference (for use with taxin). Creates a tab-delimited file with taxID, size, and contig count that can be reused in subsequent runs to avoid re-parsing the reference assembly.
hist=<file>: Cumulative bin size and contamination histogram. Generates histogram data showing the distribution of bin sizes and contamination levels for visualization and analysis of binning quality across the dataset.
ccplot=<file>: Per-bin completeness/contam data. Outputs a simple two-column format with completeness and contamination values for each bin, suitable for creating completeness vs contamination scatter plots.
contamhist=<file>: Histogram plotting #bins or bases vs %contam. Creates histogram data showing the distribution of contamination percentages across bins, enabling assessment of overall dataset contamination patterns.

Processing parameters

userna=f: Require rRNAs and tRNAs for HQ genomes. This needs either a gff file or the callgenes flag. Specifically, HQ and subtypes require at least 1 16S, 23S, and 5S, plus 18 tRNAs. When enabled, applies stricter quality criteria based on essential RNA content.
callgenes=f: Call rRNAs and tRNAs. Suboptimal for some RNA types. Enables internal gene calling using built-in algorithms. Less accurate than external annotation tools but provides basic RNA detection when GFF files are unavailable.
aligner=ssa2: Do not change this. Specifies the internal aligner type used for gene calling and sequence analysis. The ssa2 aligner is optimized for the specific requirements of bin quality assessment.
quickclade=f: Assign taxonomy using QuickClade. Enables k-mer based taxonomic classification using the QuickClade algorithm. Requires spectra parameter to specify the reference database index.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Memory usage scales with the number and size of bins being processed.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Recommended for pipelines to ensure clean failure handling when memory is insufficient.
-da: Disable assertions. Removes runtime assertion checks for performance gains in production environments. Generally not recommended unless performance profiling indicates assertion overhead is significant.

Examples

Basic Bin Grading

gradebins.sh ref=assembly.fasta bin1.fa bin2.fa bin3.fa

Grades individual bin files against the original assembly. Each bin is assessed for completeness and contamination based on taxonomic labels in contig headers.

Directory-based Processing

gradebins.sh ref=assembly.fasta in=bins_directory report=bin_quality.tsv

Processes all bin files in a directory and generates a quality report. Uses Tools.getFileOrFiles() to discover all FASTA files in the specified directory.

Integration with CheckM2

gradebins.sh in=bins checkm=checkm2_results/quality_report.tsv report=combined_results.tsv

Uses external CheckM2 quality assessments instead of internal calculations. CheckM2 results are generally more accurate for prokaryotic genomes.

Quality Assessment with Multiple Tools

gradebins.sh ref=assembly.fasta in=bins \
    checkm=checkm2_results eukcc=eukcc_results \
    gtdb=gtdbtk_output gff=annotations.gff userna=t \
    report=comprehensive_report.tsv hist=size_hist.tsv ccplot=cc_plot.tsv

Quality assessment integrating multiple external tools. Includes RNA-based quality criteria, taxonomic classification, and multiple output formats for visualization.

Fast Re-grading with Tax File

gradebins.sh taxin=genome_sizes.tsv in=bins report=results.tsv

Uses a pre-computed tax file instead of parsing the original assembly. Significantly faster for repeated analyses of the same dataset.

Algorithm Details

Contamination Calculation Algorithm

GradeBins implements a dual-mode contamination assessment system using the calcContam() method:

Internal Contamination Assessment

When external quality tools (CheckM2/EukCC) are not provided, GradeBins uses a taxonomic ID-based approach:

Taxonomic Mapping: For each bin, contigs are grouped by their taxonomic IDs (parsed from headers containing 'tid_X' patterns using BinObject.parseTaxID())
Dominant Taxon Identification: The taxonomic ID with the largest total sequence length becomes the bin's primary taxon
Completeness Calculation: completeness = dominant_taxon_size / expected_genome_size
Contamination Calculation: contamination = (total_bin_size - dominant_taxon_size) / total_bin_size
Bad Contig Counting: Contigs with taxonomic IDs different from the dominant taxon are flagged as potential contamination

External Tool Integration

When CheckM2 and/or EukCC results are provided through loadCheckM() and loadEukCC():

Best Score Selection: Compares completeness scores using checkm.completeness>=eukcc.completeness and selects the higher value
Quality Metrics Adoption: Uses the completeness and contamination values from the selected tool via c.completeness=best.completeness; c.contam=best.contam
Tool Prioritization: CheckM2 is generally preferred for prokaryotic genomes, EukCC for eukaryotic

Quality Classification System

Bins are classified into quality tiers using the printBinQuality() method with hardcoded thresholds:

UHQ (Ultra High Quality): ≥99% complete and ≤1% contamination (comp>=0.99f && contam<=0.01f)
VHQ (Very High Quality): ≥95% complete and ≤2% contamination (comp>=0.95f && contam<=0.02f)
HQ (High Quality): ≥90% complete and ≤5% contamination (contam<=0.05f && comp>=0.9f)
MQ (Medium Quality): ≥50% complete and ≤10% contamination but not HQ (contam<0.10f && comp>=0.5f)
LQ (Low Quality): <50% complete or >10% contamination
VLQ (Very Low Quality): <20% complete or >5% contamination (contam>0.20f || comp<0.20f)

RNA-Enhanced Quality Assessment

When userna=t is enabled, high-quality classifications require essential RNA content validation:

rRNA Requirements: At least 1 copy each of 16S, 23S, and 5S ribosomal RNAs (b.r16Scount>0 && b.r23Scount>0 && b.r5Scount>0)
tRNA Requirements: Minimum of 18 different tRNA genes (b.trnaCount>=18)
Gene Detection: Uses either provided GFF annotations via annotate() method or internal gene calling with callGenes()

Scoring Metrics

GradeBins calculates several aggregate metrics using the printScore() method for dataset-level assessment:

Completeness Score: (sum of completeness × size) / total_size implemented as compltScore+=Math.round(bin.complt*(bin.size-contam))
Contamination Score: (sum of contamination × size) / total_size implemented as contamScore+=contam
Total Score: sum of (completeness - 5 × contamination)² implemented as totalScore2+=score*score where score=Math.max(0, bin.complt-5*bin.contam)

Performance Implementation

Multithreaded Loading: Bins are loaded in parallel using ProcessThread class with configurable thread pools (default enabled via loadMT=true)
Memory-Efficient Parsing: Uses ConcurrentReadInputStream for streaming FASTA parsing to minimize memory footprint for large datasets
Tax File Caching: Pre-computed genome size maps using loadTaxIn() eliminate repeated assembly parsing
Thread Scaling: Thread count uses Shared.threads() with scaling limits: if(threads>16) {threads=Tools.mid(16, threads/2, 32);}

Integration with BBTools Ecosystem

GradeBins leverages multiple BBTools components:

QuickClade: K-mer based taxonomic classification using callTax() method for bins without taxonomic labels
Gene Calling: Uses CallGenes class and ProkObject for internal prokaryotic gene detection
Coverage Integration: Incorporates QuickBin coverage data using DataLoader.loadCovFile() for depth-aware analysis
Chart Generation: Uses ChartMaker class for histogram and plotting capabilities for quality visualization

Output Formats

Quality Report Format

The main report file generated by printClusterReport() contains the following columns:

Bin: Bin file name
Size: Total sequence length in base pairs
Contigs: Number of contigs in the bin
GC: Average GC content percentage
Depth: Average sequencing depth
MinDepth/MaxDepth: Depth range across contigs
Completeness: Estimated genome completeness (0-1)
Contam: Estimated contamination fraction (0-1)
TaxID: Primary taxonomic identifier
Type: Quality classification (UHQ/VHQ/HQ/MQ/LQ/VLQ)
RNA Counts: 16S, 18S, 23S, 5S, tRNA counts (if enabled)
Gene Counts: CDS count and total length (if enabled)
Lineage: Full taxonomic lineage (if GTDB data provided)

Summary Statistics

Console output includes dataset-level statistics from multiple methods:

Recovery Metrics: Percentage of sequences and contigs recovered in bins from printScore() method
Quality Distribution: Counts of bins in each quality tier from printBinQuality() method
Contamination Analysis: Clean vs dirty bin statistics from printCleanDirty() method
Taxonomic Diversity: Unique taxa counts at different phylogenetic levels from printTaxLevels() method
Performance Scores: Aggregate completeness, contamination, and total scores calculated by printScore()

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org