GradeBins
Grades metagenome bins for completeness and contamination. The contigs can be labeled with their taxID; in which case the header should contain 'tid_X' somewhere where X is a number unique to their proper genome. Alternately, CheckM2 and/or EukCC output can be fed to it. Do not include a 'chaff' file (for unbinned contigs) when grading. Completeness Score is (sum of completeness*size)/(total size) for all bins. Contamination Score is (sum of contam*size)/(total size) for all bins. Total Score is (sum of (completeness-5*contam)^2) for all bins. Bin Definitions: UHQ: >=99% complete and <=1% contam (subset of VHQ) VHQ: >=95% complete and <=2% contam (subset of HQ) HQ: >=90% complete and <=5% contam MQ: >=50% complete and <=10% contam, but not HQ LQ: <50% complete or >10% contam VLQ: <20% complete or >5% contam (subset of LQ)
Basic Usage
gradebins.sh ref=assembly bin*.fa
gradebins.sh ref=assembly.fa in=bin_directory
gradebins.sh taxin=tax.txt in=bins
GradeBins evaluates the quality of metagenome bins by calculating completeness and contamination metrics. It can process bins with labeled contigs (containing taxID information) or integrate results from external quality assessment tools like CheckM2 and EukCC.
Parameters
Parameters are organized by their function in the bin grading process. Each parameter group serves a specific purpose in the bin quality assessment workflow.
Input parameters
- ref=<file>
- The original assembly that was binned. Required for calculating completeness when not using taxin. The reference assembly is used to build a size map for each taxonomic ID to determine expected genome sizes.
- in=<directory>
- Location of bin fastas. Can specify individual files or a directory containing multiple bin files. Supports both individual file arguments and directory scanning using Tools.getFileOrFiles() for batch processing.
- checkm=<file>
- Optional CheckM2 quality_report.tsv file or directory. If a directory is provided, looks for quality_report.tsv within it. CheckM2 results take precedence over internal calculations when available.
- eukcc=<file>
- Optional EukCC eukcc.csv file or directory. If a directory is provided, looks for eukcc.csv within it. Used for eukaryotic bin assessment and compared with CheckM2 results to select the best quality scores.
- cami=<file>
- Optional binning file from CAMI which indicates contig TaxIDs. Provides taxonomic labels for contigs in standardized CAMI format, overriding any taxID information parsed from contig headers.
- taxin=<file>
- Optional file with taxIDs and sizes (instead of loading ref). Does not need to include taxIDs. The tax file loads faster. Tab-delimited format with columns: taxID, size, contigs. Enables faster processing by avoiding assembly parsing.
- gtdb=<file>
- Optional gtdbtk file. Can be a single file or directory containing gtdbtk.bac120.summary.tsv and gtdbtk.ar53.summary.tsv files. Provides GTDB taxonomic classifications for lineage reporting.
- gff=<file>
- Optional gff file. Used for rRNA and tRNA annotation when userna=t is enabled. Provides gene annotations necessary for high-quality genome determination based on essential RNA content.
- imgmap=<file>
- Optional IMG map file, for renamed IMG gff input. Maps between original and renamed contig identifiers in IMG (Integrated Microbial Genomes) datasets to ensure proper GFF annotation matching.
- spectra=<file>
- Optional path to QuickClade index. Enables taxonomic classification using k-mer based spectra matching when quickclade=t is set. Uses pre-built reference databases for rapid taxonomic assignment.
- cov=<file>
- Optional path to QuickBin coverage file. Provides per-contig coverage information for depth-aware analysis and reporting. Coverage data is incorporated into bin statistics and quality assessment.
- loadmt=t
- Load bins multithreaded. Default: true. Enables parallel processing of multiple bins using Shared.threads() with scaling limits when threads>16 to Tools.mid(16, threads/2, 32).
Output parameters
- report=<file>
- Report on bin size, quality, and taxonomy. Generates tab-delimited report with columns for bin name, size, contig count, GC content, depth, completeness, contamination, taxonomic ID, quality type, and optional RNA/gene counts.
- taxout=<file>
- Generate a tax file from the reference (for use with taxin). Creates a tab-delimited file with taxID, size, and contig count that can be reused in subsequent runs to avoid re-parsing the reference assembly.
- hist=<file>
- Cumulative bin size and contamination histogram. Generates histogram data showing the distribution of bin sizes and contamination levels for visualization and analysis of binning quality across the dataset.
- ccplot=<file>
- Per-bin completeness/contam data. Outputs a simple two-column format with completeness and contamination values for each bin, suitable for creating completeness vs contamination scatter plots.
- contamhist=<file>
- Histogram plotting #bins or bases vs %contam. Creates histogram data showing the distribution of contamination percentages across bins, enabling assessment of overall dataset contamination patterns.
Processing parameters
- userna=f
- Require rRNAs and tRNAs for HQ genomes. This needs either a gff file or the callgenes flag. Specifically, HQ and subtypes require at least 1 16S, 23S, and 5S, plus 18 tRNAs. When enabled, applies stricter quality criteria based on essential RNA content.
- callgenes=f
- Call rRNAs and tRNAs. Suboptimal for some RNA types. Enables internal gene calling using built-in algorithms. Less accurate than external annotation tools but provides basic RNA detection when GFF files are unavailable.
- aligner=ssa2
- Do not change this. Specifies the internal aligner type used for gene calling and sequence analysis. The ssa2 aligner is optimized for the specific requirements of bin quality assessment.
- quickclade=f
- Assign taxonomy using QuickClade. Enables k-mer based taxonomic classification using the QuickClade algorithm. Requires spectra parameter to specify the reference database index.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Memory usage scales with the number and size of bins being processed.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Recommended for pipelines to ensure clean failure handling when memory is insufficient.
- -da
- Disable assertions. Removes runtime assertion checks for performance gains in production environments. Generally not recommended unless performance profiling indicates assertion overhead is significant.
Examples
Basic Bin Grading
gradebins.sh ref=assembly.fasta bin1.fa bin2.fa bin3.fa
Grades individual bin files against the original assembly. Each bin is assessed for completeness and contamination based on taxonomic labels in contig headers.
Directory-based Processing
gradebins.sh ref=assembly.fasta in=bins_directory report=bin_quality.tsv
Processes all bin files in a directory and generates a quality report. Uses Tools.getFileOrFiles() to discover all FASTA files in the specified directory.
Integration with CheckM2
gradebins.sh in=bins checkm=checkm2_results/quality_report.tsv report=combined_results.tsv
Uses external CheckM2 quality assessments instead of internal calculations. CheckM2 results are generally more accurate for prokaryotic genomes.
Quality Assessment with Multiple Tools
gradebins.sh ref=assembly.fasta in=bins \
checkm=checkm2_results eukcc=eukcc_results \
gtdb=gtdbtk_output gff=annotations.gff userna=t \
report=comprehensive_report.tsv hist=size_hist.tsv ccplot=cc_plot.tsv
Quality assessment integrating multiple external tools. Includes RNA-based quality criteria, taxonomic classification, and multiple output formats for visualization.
Fast Re-grading with Tax File
gradebins.sh taxin=genome_sizes.tsv in=bins report=results.tsv
Uses a pre-computed tax file instead of parsing the original assembly. Significantly faster for repeated analyses of the same dataset.
Algorithm Details
Contamination Calculation Algorithm
GradeBins implements a dual-mode contamination assessment system using the calcContam() method:
Internal Contamination Assessment
When external quality tools (CheckM2/EukCC) are not provided, GradeBins uses a taxonomic ID-based approach:
- Taxonomic Mapping: For each bin, contigs are grouped by their taxonomic IDs (parsed from headers containing 'tid_X' patterns using BinObject.parseTaxID())
- Dominant Taxon Identification: The taxonomic ID with the largest total sequence length becomes the bin's primary taxon
- Completeness Calculation:
completeness = dominant_taxon_size / expected_genome_size
- Contamination Calculation:
contamination = (total_bin_size - dominant_taxon_size) / total_bin_size
- Bad Contig Counting: Contigs with taxonomic IDs different from the dominant taxon are flagged as potential contamination
External Tool Integration
When CheckM2 and/or EukCC results are provided through loadCheckM() and loadEukCC():
- Best Score Selection: Compares completeness scores using
checkm.completeness>=eukcc.completeness
and selects the higher value - Quality Metrics Adoption: Uses the completeness and contamination values from the selected tool via
c.completeness=best.completeness; c.contam=best.contam
- Tool Prioritization: CheckM2 is generally preferred for prokaryotic genomes, EukCC for eukaryotic
Quality Classification System
Bins are classified into quality tiers using the printBinQuality() method with hardcoded thresholds:
- UHQ (Ultra High Quality): ≥99% complete and ≤1% contamination (
comp>=0.99f && contam<=0.01f
) - VHQ (Very High Quality): ≥95% complete and ≤2% contamination (
comp>=0.95f && contam<=0.02f
) - HQ (High Quality): ≥90% complete and ≤5% contamination (
contam<=0.05f && comp>=0.9f
) - MQ (Medium Quality): ≥50% complete and ≤10% contamination but not HQ (
contam<0.10f && comp>=0.5f
) - LQ (Low Quality): <50% complete or >10% contamination
- VLQ (Very Low Quality): <20% complete or >5% contamination (
contam>0.20f || comp<0.20f
)
RNA-Enhanced Quality Assessment
When userna=t
is enabled, high-quality classifications require essential RNA content validation:
- rRNA Requirements: At least 1 copy each of 16S, 23S, and 5S ribosomal RNAs (
b.r16Scount>0 && b.r23Scount>0 && b.r5Scount>0
) - tRNA Requirements: Minimum of 18 different tRNA genes (
b.trnaCount>=18
) - Gene Detection: Uses either provided GFF annotations via annotate() method or internal gene calling with callGenes()
Scoring Metrics
GradeBins calculates several aggregate metrics using the printScore() method for dataset-level assessment:
- Completeness Score:
(sum of completeness × size) / total_size
implemented ascompltScore+=Math.round(bin.complt*(bin.size-contam))
- Contamination Score:
(sum of contamination × size) / total_size
implemented ascontamScore+=contam
- Total Score:
sum of (completeness - 5 × contamination)²
implemented astotalScore2+=score*score
wherescore=Math.max(0, bin.complt-5*bin.contam)
Performance Implementation
- Multithreaded Loading: Bins are loaded in parallel using ProcessThread class with configurable thread pools (default enabled via
loadMT=true
) - Memory-Efficient Parsing: Uses ConcurrentReadInputStream for streaming FASTA parsing to minimize memory footprint for large datasets
- Tax File Caching: Pre-computed genome size maps using loadTaxIn() eliminate repeated assembly parsing
- Thread Scaling: Thread count uses
Shared.threads()
with scaling limits:if(threads>16) {threads=Tools.mid(16, threads/2, 32);}
Integration with BBTools Ecosystem
GradeBins leverages multiple BBTools components:
- QuickClade: K-mer based taxonomic classification using callTax() method for bins without taxonomic labels
- Gene Calling: Uses CallGenes class and ProkObject for internal prokaryotic gene detection
- Coverage Integration: Incorporates QuickBin coverage data using DataLoader.loadCovFile() for depth-aware analysis
- Chart Generation: Uses ChartMaker class for histogram and plotting capabilities for quality visualization
Output Formats
Quality Report Format
The main report file generated by printClusterReport() contains the following columns:
- Bin: Bin file name
- Size: Total sequence length in base pairs
- Contigs: Number of contigs in the bin
- GC: Average GC content percentage
- Depth: Average sequencing depth
- MinDepth/MaxDepth: Depth range across contigs
- Completeness: Estimated genome completeness (0-1)
- Contam: Estimated contamination fraction (0-1)
- TaxID: Primary taxonomic identifier
- Type: Quality classification (UHQ/VHQ/HQ/MQ/LQ/VLQ)
- RNA Counts: 16S, 18S, 23S, 5S, tRNA counts (if enabled)
- Gene Counts: CDS count and total length (if enabled)
- Lineage: Full taxonomic lineage (if GTDB data provided)
Summary Statistics
Console output includes dataset-level statistics from multiple methods:
- Recovery Metrics: Percentage of sequences and contigs recovered in bins from printScore() method
- Quality Distribution: Counts of bins in each quality tier from printBinQuality() method
- Contamination Analysis: Clean vs dirty bin statistics from printCleanDirty() method
- Taxonomic Diversity: Unique taxa counts at different phylogenetic levels from printTaxLevels() method
- Performance Scores: Aggregate completeness, contamination, and total scores calculated by printScore()
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org