MakeQuickBinVector
Makes vectors for QuickBin network training by generating feature vectors from contig pairs for machine learning classification of genomic binning.
Basic Usage
makequickbinvector.sh in=contigs.fa out=vector.txt cov=cov.txt lines=1m
This tool generates training vectors for QuickBin's machine learning network by analyzing contig pairs and extracting features including GC content, tetranucleotide frequency (TNF), coverage depth ratios, and other genomic characteristics.
Parameters
Parameters are organized by their function in the vector generation process. The tool analyzes contig pairs to generate feature vectors suitable for neural network training.
Input/Output Parameters
- in=<file>
- Assembly input file in FASTA format. This is the only required parameter. Contains the contigs to be analyzed for vector generation.
- cov=<file>
- Coverage file generated by QuickBin from SAM files. Contains depth information for each contig used in coverage-based filtering and feature extraction.
- out=<file>
- Output file for the generated feature vectors. Each line represents one contig pair comparison with tab-delimited feature values.
Vector Generation Parameters
- lines=1m
- Number of lines (vector comparisons) to output. Default: 1,000,000. Controls the size of the training dataset generated.
- rate=0.5
- Fraction of vectors with positive results (same-bin pairs vs different-bin pairs). Default: 0.5. Balances the training dataset between positive and negative examples.
- seed=<long>
- Random seed for reproducible vector generation. If not specified, uses system time. Ensures consistent results across runs for testing.
- rolls=<int>
- Number of random rolls for contig selection. Higher values bias toward smaller indices. Default value determined by implementation.
Filtering Parameters
- mincontig=200
- Do not load contigs shorter than this length. Default: 200. Filters out very short contigs that may not have reliable genomic signatures.
- minlen=0
- Do not print comparisons where either contig is shorter than this length. Default: 0. Additional length filtering for vector generation.
- maxlen=2B
- Do not print comparisons where both contigs are longer than this length. Default: 2 billion bases. Prevents analysis of extremely large contigs.
- maxgcdif=1.0
- Maximum allowed GC content difference for output. Default: 1.0 (no filtering). Contigs with GC differences exceeding this threshold are excluded from comparison.
- maxkmerdif=1.0
- Maximum allowed tetranucleotide frequency (TNF) cosine difference for output. Default: 1.0 (no filtering). Filters contig pairs based on k-mer signature similarity.
- maxdepthratio=1000.0
- Maximum allowed coverage depth ratio for output. Default: 1000.0. Filters contig pairs with vastly different coverage depths, as they are unlikely to be from the same bin.
Clustering Parameters
- mcc=9
- Maximum contigs per cluster. Default: 9. Controls cluster size when generating multi-contig comparisons for more complex vector features.
- edgefraction=<float>
- Fraction of comparisons to include edge connections from contig pair maps. Incorporates assembly graph information when available.
Advanced Parameters
- kmerdif=<file>
- Output k-mer difference statistics to specified file. Use % in filename to create separate files for positive and negative examples.
- kmerfraction=<file>
- Output k-mer difference fraction analysis to specified file. Generates percentile-based k-mer difference distributions.
- printSizeInVector=<boolean>
- Include contig size information in the feature vector. Default determined by Oracle settings.
- printNetOutputInVector=<boolean>
- Include network output information in the feature vector. Default determined by Oracle settings.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding autodetection. Example: -Xmx20g specifies 20GB of RAM, -Xmx200m specifies 200MB. Maximum is typically 85% of physical memory.
- -eoom
- Exit if an out-of-memory exception occurs. Requires Java 8u92+. Prevents hanging on memory exhaustion.
- -da
- Disable Java assertions for slightly better performance in production use.
Examples
Basic Vector Generation
makequickbinvector.sh in=assembly.fasta out=training_vectors.txt cov=coverage.txt
Generate 1 million training vectors from an assembly using default parameters. The coverage file provides depth information for filtering and feature extraction.
Balanced Training Dataset
makequickbinvector.sh in=contigs.fa out=vectors.txt cov=cov.txt lines=500k rate=0.6 mincontig=500
Generate 500,000 vectors with 60% positive examples, using only contigs longer than 500 bp. Higher positive rate may help with imbalanced datasets.
Strict Filtering
makequickbinvector.sh in=assembly.fa out=filtered_vectors.txt cov=coverage.txt maxgcdif=0.15 maxkmerdif=0.05 maxdepthratio=3.0
Generate vectors with strict filtering: GC difference ≤15%, k-mer difference ≤5%, and depth ratio ≤3x. Produces higher-quality training examples.
K-mer Analysis Output
makequickbinvector.sh in=contigs.fa out=vectors.txt cov=coverage.txt kmerdif=kmer_stats_%.txt kmerfraction=kmer_fractions.txt
Generate vectors while also outputting k-mer difference statistics. The % in the filename creates separate files for positive (1) and negative (0) examples.
Algorithm Details
Vector Generation Strategy
The makeVector() method implements three comparison modes using randomIndex() selection with bias rolls:
- Single Contig Comparisons (numClusters=0): selectContig() chooses individual contigs from taxonomic groups using IntHashSet for duplicate prevention
- Hybrid Comparisons (numClusters=1): selectCluster() creates multi-contig clusters with up to maxClusterContigs members, compared against single contigs
- Cluster Comparisons (numClusters=2): Two separate clusters created with selectCluster(), each containing 2-9 contigs based on maxClusterContigs parameter
Feature Extraction
Oracle.toVector() generates feature vectors containing:
- Tetranucleotide Frequency (TNF): SimilarityMeasures.calculateDifferenceAverage() computes cosine differences between 4-mer count arrays
- Coverage Depth Ratios: depthRatio() calculates logarithmic depth differences using Bin class methods
- GC Content Differences: Tools.absdif() computes absolute GC content differences using gc() methods
- Length-based Features: size() comparisons and contig length ratios when Oracle.printSizeInVector enabled
- Assembly Graph Features: edgeFraction parameter controls inclusion of pairMap edge weights from assembly graphs
Quality Control and Filtering
The passesFilter() method implements cascade filtering:
- Size Filtering: selectContig() enforces minSize/maxSize bounds during selection with 40-iteration retry loops
- Composition Filtering: Tools.absdif(a.gc(), b.gc()) comparison against maxGCDif threshold
- Coverage Filtering: depthRatio() calculation with maxDepthRatio threshold enforcement
- Product Filtering: Combined metric maxProduct = maxKmerDif * maxDepthRatio * 0.75f prevents edge cases
Training Data Balance
The outputResults() method maintains balance using:
- randy.nextFloat() <= positiveRate comparison controls positive/negative example ratios
- HashMap<Integer, ArrayList<Contig>> taxonomic grouping by labelTaxid for stratified sampling
- randomIndex() with baseRolls parameter biases selection toward smaller indices for diverse representation
Performance Characteristics
- Memory Usage: FloatList vecBuffer and ByteBuilder lineBuffer reused per thread, scales with allContigs.size()
- Output Format: ByteStreamWriter writes tab-delimited vectors with header specifying vecBuffer.size dimensions
- Reproducibility: Shared.threadLocalRandom(seed) ensures deterministic Random state when seed parameter specified
- Scalability: selectCluster() uses 100-iteration contig selection loops with early termination for large assemblies
Output Format
ByteStreamWriter creates tab-delimited files with specific structure:
- Header Line: #dims format specifies vecBuffer.size() dimensions, weights flag, and format version
- Feature Vectors: toLine() method converts FloatList to tab-delimited strings with 7-digit precision
- Classification Label: vector.lastElement() contains binary classification (1=positive, 0=negative)
- Feature Order: Oracle class defines vector element ordering: TNF cosine differences, depth ratios, GC differences, optional size/network features
Integration with QuickBin
The generated vectors match Oracle vector format requirements for QuickBin neural network input. Feature dimensions and scaling compatible with network training pipelines.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org