GradeBins

Script: gradebins.sh Package: bin Class: GradeBins.java

Grades metagenome bins for completeness and contamination. The contigs can be labeled with their taxID; in which case the header should contain 'tid_X' somewhere where X is a number unique to their proper genome. Alternately, CheckM2 and/or EukCC output can be fed to it. Do not include a 'chaff' file (for unbinned contigs) when grading. Completeness Score is (sum of completeness*size)/(total size) for all bins. Contamination Score is (sum of contam*size)/(total size) for all bins. Total Score is (sum of (completeness-5*contam)^2) for all bins. Bin Definitions: UHQ: >=99% complete and <=1% contam (subset of VHQ) VHQ: >=95% complete and <=2% contam (subset of HQ) HQ: >=90% complete and <=5% contam MQ: >=50% complete and <=10% contam, but not HQ LQ: <50% complete or >10% contam VLQ: <20% complete or >5% contam (subset of LQ)

Basic Usage

gradebins.sh ref=assembly bin*.fa
gradebins.sh ref=assembly.fa in=bin_directory
gradebins.sh taxin=tax.txt in=bins

GradeBins evaluates the quality of metagenome bins by calculating completeness and contamination metrics. It can process bins with labeled contigs (containing taxID information) or integrate results from external quality assessment tools like CheckM2 and EukCC.

Parameters

Parameters are organized by their function in the bin grading process. Each parameter group serves a specific purpose in the bin quality assessment workflow.

Input parameters

ref=<file>
The original assembly that was binned. Required for calculating completeness when not using taxin. The reference assembly is used to build a size map for each taxonomic ID to determine expected genome sizes.
in=<directory>
Location of bin fastas. Can specify individual files or a directory containing multiple bin files. Supports both individual file arguments and directory scanning using Tools.getFileOrFiles() for batch processing.
checkm=<file>
Optional CheckM2 quality_report.tsv file or directory. If a directory is provided, looks for quality_report.tsv within it. CheckM2 results take precedence over internal calculations when available.
eukcc=<file>
Optional EukCC eukcc.csv file or directory. If a directory is provided, looks for eukcc.csv within it. Used for eukaryotic bin assessment and compared with CheckM2 results to select the best quality scores.
cami=<file>
Optional binning file from CAMI which indicates contig TaxIDs. Provides taxonomic labels for contigs in standardized CAMI format, overriding any taxID information parsed from contig headers.
taxin=<file>
Optional file with taxIDs and sizes (instead of loading ref). Does not need to include taxIDs. The tax file loads faster. Tab-delimited format with columns: taxID, size, contigs. Enables faster processing by avoiding assembly parsing.
gtdb=<file>
Optional gtdbtk file. Can be a single file or directory containing gtdbtk.bac120.summary.tsv and gtdbtk.ar53.summary.tsv files. Provides GTDB taxonomic classifications for lineage reporting.
gff=<file>
Optional gff file. Used for rRNA and tRNA annotation when userna=t is enabled. Provides gene annotations necessary for high-quality genome determination based on essential RNA content.
imgmap=<file>
Optional IMG map file, for renamed IMG gff input. Maps between original and renamed contig identifiers in IMG (Integrated Microbial Genomes) datasets to ensure proper GFF annotation matching.
spectra=<file>
Optional path to QuickClade index. Enables taxonomic classification using k-mer based spectra matching when quickclade=t is set. Uses pre-built reference databases for rapid taxonomic assignment.
cov=<file>
Optional path to QuickBin coverage file. Provides per-contig coverage information for depth-aware analysis and reporting. Coverage data is incorporated into bin statistics and quality assessment.
loadmt=t
Load bins multithreaded. Default: true. Enables parallel processing of multiple bins using Shared.threads() with scaling limits when threads>16 to Tools.mid(16, threads/2, 32).

Output parameters

report=<file>
Report on bin size, quality, and taxonomy. Generates tab-delimited report with columns for bin name, size, contig count, GC content, depth, completeness, contamination, taxonomic ID, quality type, and optional RNA/gene counts.
taxout=<file>
Generate a tax file from the reference (for use with taxin). Creates a tab-delimited file with taxID, size, and contig count that can be reused in subsequent runs to avoid re-parsing the reference assembly.
hist=<file>
Cumulative bin size and contamination histogram. Generates histogram data showing the distribution of bin sizes and contamination levels for visualization and analysis of binning quality across the dataset.
ccplot=<file>
Per-bin completeness/contam data. Outputs a simple two-column format with completeness and contamination values for each bin, suitable for creating completeness vs contamination scatter plots.
contamhist=<file>
Histogram plotting #bins or bases vs %contam. Creates histogram data showing the distribution of contamination percentages across bins, enabling assessment of overall dataset contamination patterns.

Processing parameters

userna=f
Require rRNAs and tRNAs for HQ genomes. This needs either a gff file or the callgenes flag. Specifically, HQ and subtypes require at least 1 16S, 23S, and 5S, plus 18 tRNAs. When enabled, applies stricter quality criteria based on essential RNA content.
callgenes=f
Call rRNAs and tRNAs. Suboptimal for some RNA types. Enables internal gene calling using built-in algorithms. Less accurate than external annotation tools but provides basic RNA detection when GFF files are unavailable.
aligner=ssa2
Do not change this. Specifies the internal aligner type used for gene calling and sequence analysis. The ssa2 aligner is optimized for the specific requirements of bin quality assessment.
quickclade=f
Assign taxonomy using QuickClade. Enables k-mer based taxonomic classification using the QuickClade algorithm. Requires spectra parameter to specify the reference database index.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Memory usage scales with the number and size of bins being processed.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Recommended for pipelines to ensure clean failure handling when memory is insufficient.
-da
Disable assertions. Removes runtime assertion checks for performance gains in production environments. Generally not recommended unless performance profiling indicates assertion overhead is significant.

Examples

Basic Bin Grading

gradebins.sh ref=assembly.fasta bin1.fa bin2.fa bin3.fa

Grades individual bin files against the original assembly. Each bin is assessed for completeness and contamination based on taxonomic labels in contig headers.

Directory-based Processing

gradebins.sh ref=assembly.fasta in=bins_directory report=bin_quality.tsv

Processes all bin files in a directory and generates a quality report. Uses Tools.getFileOrFiles() to discover all FASTA files in the specified directory.

Integration with CheckM2

gradebins.sh in=bins checkm=checkm2_results/quality_report.tsv report=combined_results.tsv

Uses external CheckM2 quality assessments instead of internal calculations. CheckM2 results are generally more accurate for prokaryotic genomes.

Quality Assessment with Multiple Tools

gradebins.sh ref=assembly.fasta in=bins \
    checkm=checkm2_results eukcc=eukcc_results \
    gtdb=gtdbtk_output gff=annotations.gff userna=t \
    report=comprehensive_report.tsv hist=size_hist.tsv ccplot=cc_plot.tsv

Quality assessment integrating multiple external tools. Includes RNA-based quality criteria, taxonomic classification, and multiple output formats for visualization.

Fast Re-grading with Tax File

gradebins.sh taxin=genome_sizes.tsv in=bins report=results.tsv

Uses a pre-computed tax file instead of parsing the original assembly. Significantly faster for repeated analyses of the same dataset.

Algorithm Details

Contamination Calculation Algorithm

GradeBins implements a dual-mode contamination assessment system using the calcContam() method:

Internal Contamination Assessment

When external quality tools (CheckM2/EukCC) are not provided, GradeBins uses a taxonomic ID-based approach:

External Tool Integration

When CheckM2 and/or EukCC results are provided through loadCheckM() and loadEukCC():

Quality Classification System

Bins are classified into quality tiers using the printBinQuality() method with hardcoded thresholds:

RNA-Enhanced Quality Assessment

When userna=t is enabled, high-quality classifications require essential RNA content validation:

Scoring Metrics

GradeBins calculates several aggregate metrics using the printScore() method for dataset-level assessment:

Performance Implementation

Integration with BBTools Ecosystem

GradeBins leverages multiple BBTools components:

Output Formats

Quality Report Format

The main report file generated by printClusterReport() contains the following columns:

Summary Statistics

Console output includes dataset-level statistics from multiple methods:

Support

For questions and support: