MakeQuickBinVector

Script: makequickbinvector.sh Package: bin Class: AllToAllVectorMaker.java

Makes vectors for QuickBin network training by generating feature vectors from contig pairs for machine learning classification of genomic binning.

Basic Usage

makequickbinvector.sh in=contigs.fa out=vector.txt cov=cov.txt lines=1m

This tool generates training vectors for QuickBin's machine learning network by analyzing contig pairs and extracting features including GC content, tetranucleotide frequency (TNF), coverage depth ratios, and other genomic characteristics.

Parameters

Parameters are organized by their function in the vector generation process. The tool analyzes contig pairs to generate feature vectors suitable for neural network training.

Input/Output Parameters

in=<file>
Assembly input file in FASTA format. This is the only required parameter. Contains the contigs to be analyzed for vector generation.
cov=<file>
Coverage file generated by QuickBin from SAM files. Contains depth information for each contig used in coverage-based filtering and feature extraction.
out=<file>
Output file for the generated feature vectors. Each line represents one contig pair comparison with tab-delimited feature values.

Vector Generation Parameters

lines=1m
Number of lines (vector comparisons) to output. Default: 1,000,000. Controls the size of the training dataset generated.
rate=0.5
Fraction of vectors with positive results (same-bin pairs vs different-bin pairs). Default: 0.5. Balances the training dataset between positive and negative examples.
seed=<long>
Random seed for reproducible vector generation. If not specified, uses system time. Ensures consistent results across runs for testing.
rolls=<int>
Number of random rolls for contig selection. Higher values bias toward smaller indices. Default value determined by implementation.

Filtering Parameters

mincontig=200
Do not load contigs shorter than this length. Default: 200. Filters out very short contigs that may not have reliable genomic signatures.
minlen=0
Do not print comparisons where either contig is shorter than this length. Default: 0. Additional length filtering for vector generation.
maxlen=2B
Do not print comparisons where both contigs are longer than this length. Default: 2 billion bases. Prevents analysis of extremely large contigs.
maxgcdif=1.0
Maximum allowed GC content difference for output. Default: 1.0 (no filtering). Contigs with GC differences exceeding this threshold are excluded from comparison.
maxkmerdif=1.0
Maximum allowed tetranucleotide frequency (TNF) cosine difference for output. Default: 1.0 (no filtering). Filters contig pairs based on k-mer signature similarity.
maxdepthratio=1000.0
Maximum allowed coverage depth ratio for output. Default: 1000.0. Filters contig pairs with vastly different coverage depths, as they are unlikely to be from the same bin.

Clustering Parameters

mcc=9
Maximum contigs per cluster. Default: 9. Controls cluster size when generating multi-contig comparisons for more complex vector features.
edgefraction=<float>
Fraction of comparisons to include edge connections from contig pair maps. Incorporates assembly graph information when available.

Advanced Parameters

kmerdif=<file>
Output k-mer difference statistics to specified file. Use % in filename to create separate files for positive and negative examples.
kmerfraction=<file>
Output k-mer difference fraction analysis to specified file. Generates percentile-based k-mer difference distributions.
printSizeInVector=<boolean>
Include contig size information in the feature vector. Default determined by Oracle settings.
printNetOutputInVector=<boolean>
Include network output information in the feature vector. Default determined by Oracle settings.

Java Parameters

-Xmx
Set Java's memory usage, overriding autodetection. Example: -Xmx20g specifies 20GB of RAM, -Xmx200m specifies 200MB. Maximum is typically 85% of physical memory.
-eoom
Exit if an out-of-memory exception occurs. Requires Java 8u92+. Prevents hanging on memory exhaustion.
-da
Disable Java assertions for slightly better performance in production use.

Examples

Basic Vector Generation

makequickbinvector.sh in=assembly.fasta out=training_vectors.txt cov=coverage.txt

Generate 1 million training vectors from an assembly using default parameters. The coverage file provides depth information for filtering and feature extraction.

Balanced Training Dataset

makequickbinvector.sh in=contigs.fa out=vectors.txt cov=cov.txt lines=500k rate=0.6 mincontig=500

Generate 500,000 vectors with 60% positive examples, using only contigs longer than 500 bp. Higher positive rate may help with imbalanced datasets.

Strict Filtering

makequickbinvector.sh in=assembly.fa out=filtered_vectors.txt cov=coverage.txt maxgcdif=0.15 maxkmerdif=0.05 maxdepthratio=3.0

Generate vectors with strict filtering: GC difference ≤15%, k-mer difference ≤5%, and depth ratio ≤3x. Produces higher-quality training examples.

K-mer Analysis Output

makequickbinvector.sh in=contigs.fa out=vectors.txt cov=coverage.txt kmerdif=kmer_stats_%.txt kmerfraction=kmer_fractions.txt

Generate vectors while also outputting k-mer difference statistics. The % in the filename creates separate files for positive (1) and negative (0) examples.

Algorithm Details

Vector Generation Strategy

The makeVector() method implements three comparison modes using randomIndex() selection with bias rolls:

Feature Extraction

Oracle.toVector() generates feature vectors containing:

Quality Control and Filtering

The passesFilter() method implements cascade filtering:

  1. Size Filtering: selectContig() enforces minSize/maxSize bounds during selection with 40-iteration retry loops
  2. Composition Filtering: Tools.absdif(a.gc(), b.gc()) comparison against maxGCDif threshold
  3. Coverage Filtering: depthRatio() calculation with maxDepthRatio threshold enforcement
  4. Product Filtering: Combined metric maxProduct = maxKmerDif * maxDepthRatio * 0.75f prevents edge cases

Training Data Balance

The outputResults() method maintains balance using:

Performance Characteristics

Output Format

ByteStreamWriter creates tab-delimited files with specific structure:

Integration with QuickBin

The generated vectors match Oracle vector format requirements for QuickBin neural network input. Feature dimensions and scaling compatible with network training pipelines.

Support

For questions and support: