QuickBin

Script: quickbin.sh Package: bin Class: QuickBin.java

Bins contigs using coverage and kmer frequencies for metagenome assembly analysis. Coverage can be calculated from reads/sam files or parsed from contig headers (Spades/Tadpole format). Supports multiple sam files from different samples for improved binning accuracy and uses neural networks with quantized GC-depth matrices for clustering decisions.

Basic Usage

# Basic binning with sam files
quickbin.sh in=contigs.fa out=bins *.sam covout=cov.txt

# Using pre-calculated coverage file
quickbin.sh in=contigs.fa out=bins cov=cov.txt

# Simple format (positional arguments)
quickbin.sh contigs.fa out=bins *.sam

QuickBin processes metagenomic assemblies to group contigs into bins representing individual organisms. Coverage information can be provided via sam/bam files or pre-calculated coverage files for faster rerunning.

Parameters

Parameters are organized by their function in the binning process. Coverage-based binning works best with multiple sam files from different samples of the same environment.

File parameters

in=<file>: Assembly input file in fasta format. Files ending in *.fa do not require the 'in=' prefix and can be specified as positional arguments.
reads=<file>: Read input in sam or bam format. Multiple sam files may be specified comma-delimited or as plain arguments without 'reads='. Each file is treated as an independent sample, and more samples generally improve binning accuracy.
covout=<file>: Output file for coverage statistics summarizing sam files. This allows much faster rerunning of QuickBin by skipping the sam file processing step.
cov=<file>: Pre-calculated coverage file generated by QuickBin via 'covout' parameter. Can be used instead of sam/bam files for faster processing. Files named cov*.txt are automatically recognized without the 'cov=' prefix.
out=<pattern>: Output pattern for binned contigs. If the pattern contains a % symbol (e.g., bin%.fa), one file will be created per bin with the % replaced by the bin number. Without %, all contigs are written to a single file with bin numbers in headers. Patterns without a '.' symbol (e.g., 'out=output') are treated as directories.
chaff: Enable writing small clusters below the minimum size threshold to a shared residual file instead of discarding them.

Size parameters

mincluster=50k: (mcs) Minimum output cluster size in base pairs. Smaller clusters will be written to a shared residual file if chaff=t, otherwise they are discarded from output.
mincontig=100: Minimum contig length to load and process. Contigs below this size are ignored, which reduces memory usage but may affect completeness for short contigs.
minseed=3000: Minimum contig length to initiate a new cluster. Reducing this value can dramatically increase speed for large metagenomes and increase sensitivity for small contigs, but may slightly increase contamination. For large metagenomes with only 1 sample, values below 2000 will run slowly; with 3+ samples, speed is less affected.
minresidue=200: Discard unclustered contigs shorter than this length after binning is complete. This reduces memory usage by removing very short sequences unlikely to contribute meaningfully to bins.
dumpsequence: (TODO) Future feature to discard sequence data after processing to reduce memory usage while retaining statistical information.
dumpheaders: (TODO) Future feature to discard header information after processing to reduce memory usage.
minpentamersize=2k: Minimum contig length for pentamer frequency analysis. Increasing this value reduces memory usage but may decrease binning accuracy for contigs near the threshold.

Stringency parameters

normal: Default stringency setting. Available stringency levels in order of increasing sensitivity: xstrict (xs), ustrict (us), vstrict (vs), strict (s), normal (n), loose (l), vloose (vl), uloose (ul), xloose (xl). 'Normal' aims for under 1% contamination, while 'uloose' is more comparable to other binners. Set stringency by adding the flag without an equals sign.
xstrict, xs: Extremely strict binning mode with maximum purity but lowest sensitivity. Uses stringency multiplier of 0.6.
ustrict, us: Ultra-strict binning mode with very high purity. Uses stringency multiplier of 0.7.
vstrict, vs: Very strict binning mode with high purity. Uses stringency multiplier of 0.8.
strict, s: Strict binning mode favoring purity over completeness. Uses stringency multiplier of 0.9.
loose, l: Loose binning mode favoring completeness over purity. Uses stringency multiplier of 1.125.
vloose, vl: Very loose binning mode with higher sensitivity. Uses stringency multiplier of 1.25.
uloose, ul: Ultra-loose binning mode with high sensitivity, comparable to other binners. Uses stringency multiplier of 1.375.
xloose, xl: Extremely loose binning mode with maximum sensitivity but higher contamination risk. Uses stringency multiplier of 1.5.

Quantization parameters

gcwidth=0.02: Width of GC content matrix gridlines for binning calculations. Smaller values provide finer resolution but slower processing. Each reduction by half can roughly double processing speed.
depthwidth=0.5: Width of coverage depth matrix gridlines on a log2 scale. A value of 0.5 creates 2 gridlines per power of 2 depth (at 0.707, 1, 1.414, 2, 2.818, 4, etc.). Smaller values provide finer resolution but slower processing.

Neural network parameters

net=auto: Specify a neural network file for binning decisions. Default uses bbmap/resources/quickbin1D_all.bbnet, which provides clustering based on coverage and composition patterns using multi-layer k-mer frequency analysis.
cutoff=0.52: Neural network output threshold for binning decisions. Higher values increase specificity (fewer false positive merges), lower values increase sensitivity (more merges). This is a soft cutoff that moderates other stringency settings - increasing this makes 'strict' mode even stricter.

Edge-processing parameters

e1=0: Number of edge-first clustering passes. Edge-first clustering uses read-pair connections to merge contigs early in the process, potentially increasing speed at the cost of purity.
e2=4: Number of later edge-based clustering passes. These occur after initial clustering and use more stringent criteria to merge bins connected by paired reads.
edgeStringency1=0.25: Stringency threshold for edge-first clustering passes. Lower values are more stringent (require stronger evidence for merging).
edgeStringency2=1.1: Stringency threshold for later edge-based clustering passes. This is typically less stringent than edge-first clustering.
maxEdges=3: Maximum number of edges to follow per contig during edge-based clustering. Limiting edges prevents excessive merging of weakly connected contigs.
minEdgeWeight=2: Minimum number of read pairs required to create an edge between contigs. Edges with fewer supporting read pairs are ignored as potentially spurious connections.
minEdgeRatio=0.4: Minimum fraction of the maximum edge weight required to consider an edge. Edges below this fraction of the strongest edge from a contig are ignored to focus on the most significant connections.
goodEdgeMult=1.4: Stringency multiplier for contigs connected by edges. Lower values make edge-connected merging more stringent, requiring stronger evidence beyond the edge connection.
minmapq=20: Minimum mapping quality for reads used in edge creation. Reads with lower MAPQ scores are excluded from edge formation but still contribute to coverage depth. Setting to 0 allows ambiguously-mapped reads, potentially improving completeness but risking false edges.
minid=0.96: Minimum alignment identity for reads used in both edge creation and coverage calculation. Reads with lower identity are completely excluded from analysis.

Other parameters

sketchoutput=f: Use SendSketch to identify taxonomy of output clusters after binning. This provides taxonomic annotation of the resulting bins.
validate=f: Enable validation mode if contig headers contain taxonomic information (e.g., 'tid_1234'). This information will be parsed and used to evaluate binning correctness and calculate contamination/completeness metrics.
printcc=f: Print completeness and contamination statistics after each processing step. Useful for monitoring binning progress and quality during execution.
callssu=f: Identify 16S and 18S ribosomal RNA genes in contigs and prevent merging of clusters with incompatible SSU sequences. This helps maintain taxonomic coherence in bins.
minssuid=0.98: Minimum identity threshold for SSU compatibility. SSU sequences with identity below this threshold are considered incompatible and will prevent cluster merging.
aligner=quantum: Alignment algorithm for SSU comparison. Available options include ssa2, glocal, drifting, banded, and crosscut. Default quantum aligner provides efficient sequence alignment.

Java Parameters

-Xmx: Set Java heap memory usage, overriding automatic detection. Use format -Xmx20g for 20 gigabytes or -Xmx200m for 200 megabytes. Maximum recommended is typically 85% of physical memory.
-eoom: Exit immediately if an out-of-memory exception occurs. Prevents hanging processes when memory is exhausted. Requires Java 8u92 or later.
-da: Disable Java assertions for slightly improved performance in production runs.

Examples

Basic Metagenome Binning

# Bin contigs using multiple sam files
quickbin.sh in=assembly.fa out=bin%.fa sample1.sam sample2.sam sample3.sam

# Save coverage for faster rerunning
quickbin.sh in=assembly.fa out=bin%.fa covout=coverage.txt *.sam

# Rerun with saved coverage
quickbin.sh in=assembly.fa out=bin%.fa cov=coverage.txt

Multiple sam files from different samples of the same environment improve binning accuracy by providing better coverage patterns for organism identification.

Stringency Adjustment

# High purity binning (low contamination)
quickbin.sh in=assembly.fa out=bins strict *.sam

# High completeness binning (may have more contamination)
quickbin.sh in=assembly.fa out=bins uloose *.sam

# Custom neural network threshold
quickbin.sh in=assembly.fa out=bins cutoff=0.6 *.sam

Adjust stringency based on your priorities: strict modes for high purity, loose modes for high completeness.

Large Metagenome Optimization

# Optimize for speed and memory usage
quickbin.sh in=large_assembly.fa out=bins \
    minseed=2000 mincontig=500 minpentamersize=5k \
    gcwidth=0.04 depthwidth=0.75 *.sam

For large metagenomes, increase size thresholds and quantization widths to reduce memory usage and improve speed.

PacBio Metagenome Workflow

# Create synthetic paired reads from PacBio CCS
randomreadsmg.sh in=ccs.fa out=synth.fq depth=10 variance=0 \
    paired length=250 avginsert=600

# Align synthetic reads
bbmap.sh ref=contigs.fa in=synth.fq ambig=random mateqtag \
    minid=0.9 maxindel=10 out=mapped.sam

# Bin using synthetic read alignments
quickbin.sh in=contigs.fa out=bins mapped.sam

For PacBio-only metagenomes, create synthetic paired reads to provide the paired-end information needed for edge-based clustering.

Algorithm Details

QuickBin implements a multi-stage binning pipeline extending the BinObject superclass that processes contigs through DataLoader.loadData(), applies Binner clustering algorithms, and uses Oracle similarity comparisons with neural network moderation.

Core Binning Strategy

The algorithm implements quantized GC-depth matrices using Key.java for efficient binning across log2-scaled depth values and linear GC content ranges. The Key class converts continuous genomic measurements to discrete quantization levels using bit-packed hash codes, with depthwidth controlling logarithmic depth granularity and gcwidth controlling linear GC resolution.

Multi-Sample Integration

When multiple sam files are provided through SamLoader, QuickBin calculates samplesEquivalent values and applies sample-adjusted stringency thresholds via binner.setSamples(). The DataLoader processes each SAM file as an independent sample, tracking multi-sample depth patterns through IntLongHashMap structures to improve organism-specific coverage signatures.

Neural Network Enhancement

The default neural network uses Oracle.java for similarity assessments, combining 3-mer, 4-mer, and 5-mer frequency comparisons with depth ratio analysis. The Oracle class implements multiple comparison methods including cosine, Hellinger, and Euclidean distance metrics, with neural network cutoffs (net0small.cutoff, net0mid.cutoff, net0large.cutoff) providing size-dependent thresholds for merge decisions.

Edge-Based Clustering

Paired-read connections are processed through binner.followEdges() using Oracle instances with configurable edgeStringency1 and edgeStringency2 thresholds. The algorithm performs multiple passes with decreasing stringency, following up to maxEdges=3 connections per contig while requiring minimum edge weights (minEdgeWeight=2 read pairs) and edge ratios (minEdgeRatio=0.4) to prevent spurious merges.

Iterative Refinement Pipeline

The binning process executes through discrete phases: initial clustering via BinMap.makeBinMap(), optional refinement through binner.refineBinMap(), edge-following passes, residue processing via binner.processResidue(), and purification through binner.purify(). Each phase tracks comparison statistics (fastComparisons, slowComparisons, netComparisons) and applies progressively stringent Oracle similarity thresholds.

Performance Characteristics

Memory usage is dominated by quantization matrices scaled by gcwidth and depthwidth parameters, with typical usage ranging from 4-20GB for assemblies >100MB. The comparison engine achieves 10,000-100,000 comparisons per second depending on Oracle comparison depth, with performance tracked through fastComp, midComp, and slowComp counters reported in comparisons-per-second (cps) metrics.

Quality Assessment Integration

When validation=true, QuickBin parses taxonomic IDs from contig headers (tid_XXXX format) and calculates contamination through Bin.calcContam() using IntLongHashMap sizeMap structures. Quality metrics follow MAG standards via GradeBins.printBinQuality(), reporting completeness and contamination scores with BinStats classification into quality tiers (UHQ, VHQ, HQ, MQ, VLQ, LQ).

Output Format

QuickBin produces binned contigs in fasta format with several output modes:

Individual bin files: When output pattern contains %, creates separate fasta files for each bin (bin0.fa, bin1.fa, etc.)
Single file with headers: When output pattern lacks %, writes all contigs to one file with bin numbers in sequence headers
Directory mode: Output patterns without file extensions are treated as directories containing individual bin files
Quality annotations: Bin filenames can include contamination (%contam) and completeness (%comp) scores when validation is enabled

Performance Tuning

For optimal performance on different dataset sizes:

Large Metagenomes (>1GB assembly)

Increase minseed to 2000-5000 for faster processing
Use coarser quantization: gcwidth=0.04, depthwidth=1.0
Increase mincontig and minpentamersize thresholds
Consider using multiple samples to maintain accuracy with relaxed settings

Small Metagenomes (<100MB assembly)

Use finer quantization for better resolution: gcwidth=0.01, depthwidth=0.25
Lower minseed to 1000-2000 for better sensitivity to small organisms
Enable edge processing with multiple passes for improved connectivity

High-Contamination Environments

Use strict or vstrict stringency modes
Increase neural network cutoff to 0.6-0.7
Enable SSU calling to prevent taxonomically inconsistent merges

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org