MakeQuickBinVector

Basic Usage

makequickbinvector.sh in=contigs.fa out=vector.txt cov=cov.txt lines=1m

This tool generates training vectors for QuickBin's machine learning network by analyzing contig pairs and extracting features including GC content, tetranucleotide frequency (TNF), coverage depth ratios, and other genomic characteristics.

Parameters

Parameters are organized by their function in the vector generation process. The tool analyzes contig pairs to generate feature vectors suitable for neural network training.

Input/Output Parameters

in=<file>: Assembly input file in FASTA format. This is the only required parameter. Contains the contigs to be analyzed for vector generation.
cov=<file>: Coverage file generated by QuickBin from SAM files. Contains depth information for each contig used in coverage-based filtering and feature extraction.
out=<file>: Output file for the generated feature vectors. Each line represents one contig pair comparison with tab-delimited feature values.

Vector Generation Parameters

lines=1m: Number of lines (vector comparisons) to output. Default: 1,000,000. Controls the size of the training dataset generated.
rate=0.5: Fraction of vectors with positive results (same-bin pairs vs different-bin pairs). Default: 0.5. Balances the training dataset between positive and negative examples.
seed=<long>: Random seed for reproducible vector generation. If not specified, uses system time. Ensures consistent results across runs for testing.
rolls=<int>: Number of random rolls for contig selection. Higher values bias toward smaller indices. Default value determined by implementation.

Filtering Parameters

mincontig=200: Do not load contigs shorter than this length. Default: 200. Filters out very short contigs that may not have reliable genomic signatures.
minlen=0: Do not print comparisons where either contig is shorter than this length. Default: 0. Additional length filtering for vector generation.
maxlen=2B: Do not print comparisons where both contigs are longer than this length. Default: 2 billion bases. Prevents analysis of extremely large contigs.
maxgcdif=1.0: Maximum allowed GC content difference for output. Default: 1.0 (no filtering). Contigs with GC differences exceeding this threshold are excluded from comparison.
maxkmerdif=1.0: Maximum allowed tetranucleotide frequency (TNF) cosine difference for output. Default: 1.0 (no filtering). Filters contig pairs based on k-mer signature similarity.
maxdepthratio=1000.0: Maximum allowed coverage depth ratio for output. Default: 1000.0. Filters contig pairs with vastly different coverage depths, as they are unlikely to be from the same bin.

Clustering Parameters

mcc=9: Maximum contigs per cluster. Default: 9. Controls cluster size when generating multi-contig comparisons for more complex vector features.
edgefraction=<float>: Fraction of comparisons to include edge connections from contig pair maps. Incorporates assembly graph information when available.

Advanced Parameters

kmerdif=<file>: Output k-mer difference statistics to specified file. Use % in filename to create separate files for positive and negative examples.
kmerfraction=<file>: Output k-mer difference fraction analysis to specified file. Generates percentile-based k-mer difference distributions.
printSizeInVector=<boolean>: Include contig size information in the feature vector. Default determined by Oracle settings.
printNetOutputInVector=<boolean>: Include network output information in the feature vector. Default determined by Oracle settings.

Java Parameters

-Xmx: Set Java's memory usage, overriding autodetection. Example: -Xmx20g specifies 20GB of RAM, -Xmx200m specifies 200MB. Maximum is typically 85% of physical memory.
-eoom: Exit if an out-of-memory exception occurs. Requires Java 8u92+. Prevents hanging on memory exhaustion.
-da: Disable Java assertions for slightly better performance in production use.

Examples

Basic Vector Generation

makequickbinvector.sh in=assembly.fasta out=training_vectors.txt cov=coverage.txt

Generate 1 million training vectors from an assembly using default parameters. The coverage file provides depth information for filtering and feature extraction.

Balanced Training Dataset

makequickbinvector.sh in=contigs.fa out=vectors.txt cov=cov.txt lines=500k rate=0.6 mincontig=500

Generate 500,000 vectors with 60% positive examples, using only contigs longer than 500 bp. Higher positive rate may help with imbalanced datasets.

Strict Filtering

makequickbinvector.sh in=assembly.fa out=filtered_vectors.txt cov=coverage.txt maxgcdif=0.15 maxkmerdif=0.05 maxdepthratio=3.0

Generate vectors with strict filtering: GC difference ≤15%, k-mer difference ≤5%, and depth ratio ≤3x. Produces higher-quality training examples.

K-mer Analysis Output

makequickbinvector.sh in=contigs.fa out=vectors.txt cov=coverage.txt kmerdif=kmer_stats_%.txt kmerfraction=kmer_fractions.txt

Generate vectors while also outputting k-mer difference statistics. The % in the filename creates separate files for positive (1) and negative (0) examples.

Algorithm Details

Vector Generation Strategy

The makeVector() method implements three comparison modes using randomIndex() selection with bias rolls:

Single Contig Comparisons (numClusters=0): selectContig() chooses individual contigs from taxonomic groups using IntHashSet for duplicate prevention
Hybrid Comparisons (numClusters=1): selectCluster() creates multi-contig clusters with up to maxClusterContigs members, compared against single contigs
Cluster Comparisons (numClusters=2): Two separate clusters created with selectCluster(), each containing 2-9 contigs based on maxClusterContigs parameter

Feature Extraction

Oracle.toVector() generates feature vectors containing:

Tetranucleotide Frequency (TNF): SimilarityMeasures.calculateDifferenceAverage() computes cosine differences between 4-mer count arrays
Coverage Depth Ratios: depthRatio() calculates logarithmic depth differences using Bin class methods
GC Content Differences: Tools.absdif() computes absolute GC content differences using gc() methods
Length-based Features: size() comparisons and contig length ratios when Oracle.printSizeInVector enabled
Assembly Graph Features: edgeFraction parameter controls inclusion of pairMap edge weights from assembly graphs

Quality Control and Filtering

The passesFilter() method implements cascade filtering:

Size Filtering: selectContig() enforces minSize/maxSize bounds during selection with 40-iteration retry loops
Composition Filtering: Tools.absdif(a.gc(), b.gc()) comparison against maxGCDif threshold
Coverage Filtering: depthRatio() calculation with maxDepthRatio threshold enforcement
Product Filtering: Combined metric maxProduct = maxKmerDif * maxDepthRatio * 0.75f prevents edge cases

Training Data Balance

The outputResults() method maintains balance using:

randy.nextFloat() <= positiveRate comparison controls positive/negative example ratios
HashMap<Integer, ArrayList<Contig>> taxonomic grouping by labelTaxid for stratified sampling
randomIndex() with baseRolls parameter biases selection toward smaller indices for diverse representation

Performance Characteristics

Memory Usage: FloatList vecBuffer and ByteBuilder lineBuffer reused per thread, scales with allContigs.size()
Output Format: ByteStreamWriter writes tab-delimited vectors with header specifying vecBuffer.size dimensions
Reproducibility: Shared.threadLocalRandom(seed) ensures deterministic Random state when seed parameter specified
Scalability: selectCluster() uses 100-iteration contig selection loops with early termination for large assemblies

Output Format

ByteStreamWriter creates tab-delimited files with specific structure:

Header Line: #dims format specifies vecBuffer.size() dimensions, weights flag, and format version
Feature Vectors: toLine() method converts FloatList to tab-delimited strings with 7-digit precision
Classification Label: vector.lastElement() contains binary classification (1=positive, 0=negative)
Feature Order: Oracle class defines vector element ordering: TNF cosine differences, depth ratios, GC differences, optional size/network features

Integration with QuickBin

The generated vectors match Oracle vector format requirements for QuickBin neural network input. Feature dimensions and scaling compatible with network training pipelines.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org