Train

Script: train.sh Package: ml Class: Trainer.java

Trains or evaluates neural networks using configurable architectures, activation functions, and optimization techniques including annealing, seed scanning, and multi-threaded processing.

Basic Usage

train.sh in=<data> dims=<X,Y,Z> out=<trained network>

train.sh in=<data> netin=<network> evaluate

Input may be tab-delimited data vectors where the first line specifies dimensions (#dims X Y) and subsequent lines contain floating point numbers. Can be created via seqtovec.sh.

Parameters

Parameters control neural network architecture, training algorithms, and optimization strategies. The trainer uses techniques including multi-threaded processing, seed scanning, annealing, and configurable activation functions.

I/O parameters

in=<file>: Tab-delimited data vectors. The first line should look like '#dims 5 1' with the number of inputs and outputs; the first X columns are inputs, and the last Y the desired result. Subsequent lines are tab-delimited floating point numbers. Can be created via seqtovec.sh.
validate=<file>: Optional validation dataset used exclusively for evaluation.
net=<file>: Optional input network to train.
out=<file>: Final output network after the last epoch.
outb=<file>: Best discovered network according to evaluation metrics.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file.

Processing parameters

evaluate=f: Don't do any training, just evaluate the network.
dims=: Set network dimensions. E.g. dims=5,12,7,1
mindims: These allow random dimensions, but the number of inputs and outputs must agree. e.g. mindims=5,6,3,1 maxdims=5,18,15,1
maxdims: Maximum dimensions for random network generation. Used with mindims to create variable-size networks while maintaining fixed input/output layer sizes.
batches=400k: Number of batches to train.
alpha=0.08: Amount to adjust weights during backpropagation. Larger numbers train faster but may not converge.
balance=0.2: If the positive and negative samples are unequal, make copies of whichever has fewer until this ratio is met. 1.0 would make an equal number of positive and negative samples.
density=1.0: Retain at least this fraction of edges.
edges=-1: If positive, cap the maximum number of edges.
dense=t: Set dense=f (or sparse) to process as a sparse network. Dense mode is fastest for fully- or mostly-connected networks; sparse becomes faster below 0.25 density or so.

Advanced training parameters

seed=-1: A positive seed will yield deterministic output; negative will use a random seed. For multiple networks, each gets a different seed but you only need to set it once.
nets=1: Train this many networks concurrently (per cycle). Only the best network will be reported, so training more networks will yield give a better result. Higher increases memory use, but also can improve CPU utilization on many-threaded CPUs.
cycles=1: Each cycle trains 'nets' networks in parallel.
setsize=60000: Iterate through subsets of at most this size while training; larger makes batches take longer.
fpb=0.08: Only train this fraction of the subset per batch, prioritizing samples with the most error; larger is slower.

Evaluation parameters

vfraction=0.1: If no validation file is given, split off this fraction of the input dataset to use exclusively for validation.
inclusive=f: Use the full training dataset for validation. Note that 'evaluate' mode automatically used the input for validation.
cutoffeval=: Set the evaluation cutoff directly; any output above this cutoff will be considered positive, and below will be considered negative, when evaluating a sample. This does not affect training other than the printed results and the best network selection. Overrides fpr, fnr, and crossover.
crossover=1: Set 'cutoffeval' dynamically using the intersection of the FPR and FNR curves. If false positives are 3x as detrimental as false negatives, set this at 3.0; if false negatives are 2x as bad as false positives, set this at 0.5, etc.
fpr=: Set 'cutoffeval' dynamically using this false positive rate.
fnr=: Set 'cutoffeval' dynamically using this false negative rate.

Activation functions; fractions are relative and don't need to add to 1

sig=0.6: Fraction of nodes using sigmoid function.
tanh=0.4: Fraction of nodes using tanh function.
rslog=0.02: Fraction of nodes using rotationally symmetric log.
msig=0.02: Fraction of nodes using mirrored sigmoid.
swish=0.0: Fraction of nodes using swish.
esig=0.0: Fraction of nodes using extended sigmoid.
emsig=0.0: Fraction of nodes using extended mirrored sigmoid.
bell=0.0: Fraction of nodes using a bell curve.
max=0.0: Fraction of nodes using a max function (TODO).
final=rslog: Type of function used in the final layer.

Exotic parameters

scan=0: Test this many seeds initially before picking one to train.
scanbatches=1k: Evaluate scanned seeds at this point to choose the best.
simd=f: Use SIMD instructions for greater speed; requires Java 18+.
cutoffbackprop=0.5: Optimize around this point for separating positive and negative results. Unrelated to cutoffeval.
pem=1.0: Positive error mult; when value>target, multiply the error by this number to adjust the backpropagation penalty.
nem=1.0: Negative error mult; when value<target, multiply the error by this number to adjust the backpropagation penalty.
fpem=10.5: False positive error mult; when target<cutoffbackprop and value>(cutoffbackprop-spread), multiply error by this.
fnem=10.5: False negative error mult; when target>cutoffbackprop and value<(cutoffbackprop+spread), multiply error by this.
spread=0.05: Allows applying fnem/fpem multipliers to values that are barely onsides, but too close to the cutoff.
epem=0.2: Excess positive error mult; error multiplier when target>cutoff and value>target (overshot the target).
enem=0.2: Error multiplier when target<cutoff and value<target.
epm=0.2: Excess pivot mult; lower numbers give less priority to training samples that are excessively positive or negative.
cutoff=: Set both cutoffbackprop and cutoffeval.
ptriage=0.0001: Ignore this fraction of positive samples as untrainable.
ntriage=0.0005: Ignore this fraction of negative samples as untrainable.
anneal=0.003: Randomize weights by this much to avoid local minimae.
annealprob=.225: Probability of any given weight being annealed per batch.
ebs=1: (edgeblocksize) 8x gives best performance with AVX256 in sparse networks. 4x may be useful for raw sequence.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Network Training

# Train a 3-layer network from data vectors
train.sh in=vectors.txt dims=100,50,1 out=trained.net batches=10000

Creates a neural network with 100 inputs, 50 hidden nodes, and 1 output, training for 10,000 batches.

Network Evaluation

# Evaluate existing network without training
train.sh in=test_data.txt netin=trained.net evaluate

Loads a pre-trained network and evaluates its performance on test data without further training.

Advanced Training with Multiple Networks

# Train 4 networks concurrently, scan 100 seeds
train.sh in=data.txt dims=200,100,50,1 nets=4 scan=100 \
         alpha=0.1 density=0.8 out=best.net outb=intermediate.net

Trains 4 networks in parallel, scans 100 different random seeds to find the best starting point, uses sparse connectivity (80% density), and saves both final and best intermediate networks.

Custom Activation Functions

# Use custom activation function mixture
train.sh in=data.txt dims=50,25,1 sig=0.3 tanh=0.3 swish=0.4 \
         final=sigmoid batches=50000

Trains with 30% sigmoid, 30% tanh, and 40% swish activation functions in hidden layers, with sigmoid in the final layer.

Algorithm Details

Multi-Threaded Training Architecture

The trainer uses a multi-threaded architecture with separate WorkerThread and TrainerThread classes. WorkerThread instances handle JobData processing and network evaluation, while TrainerThread instances manage the training algorithms using ArrayBlockingQueue for job coordination. This design uses thread pools sized by Tools.mid(1, Shared.threads(), samples) for CPU utilization.

Seed Scanning and Network Initialization

When scan>0, the trainer evaluates multiple random seeds before selecting the best starting point. This process involves creating networks with different random initializations, training them briefly (scanbatches), and selecting those with the best early performance. This seed scanning improves final network quality by avoiding poor local minima from the start.

Adaptive Alpha and Annealing

The training uses dynamic learning rate adjustment through two mechanisms:

Alpha scaling: Learning rate increases to alphaMult×alpha over peakAlphaEpoch batches, then decreases exponentially
Weight annealing: Random weight perturbation (annealStrength) applied with probability annealprob to escape local minima

Error Multipliers

The trainer implements error multipliers from Cell class fields to handle different types of classification errors:

False positive/negative multipliers (fpem/fnem): Penalize classification errors more heavily
Excess error multipliers (epem/enem): Reduce penalty for overshooting correct classifications
Sample triage (ptriage/ntriage): Ignore most difficult samples that may be mislabeled

Sparse vs Dense Network Processing

The implementation automatically optimizes for network connectivity:

Dense mode: Optimized for fully or mostly connected networks (density > 0.25)
Sparse mode: Uses edge-block processing with configurable block sizes for networks with lower connectivity
SIMD acceleration: When available (Java 18+), uses vector instructions for faster computation

Activation Function Diversity

The trainer supports multiple activation functions simultaneously within the same network, including sigmoid, tanh, rotationally symmetric log (rslog), swish, and extended variants. This diversity helps the network learn more complex decision boundaries and improves generalization.

Memory Management

Training uses subset-based processing (setsize parameter) to manage memory usage. Large datasets are broken into subsets, and only a fraction (fpb) of each subset is processed per batch, prioritizing samples with highest error for continued training.

Performance Considerations

Memory Usage: Scales with network size × setsize × nets parameters. Use -Xmx to allocate sufficient RAM
Thread Scaling: Optimal threads = 4-8 for most systems. More nets increases memory but improves results
Batch Size: Larger setsize uses more memory but can improve convergence. Smaller fpb is slower but more precise
Network Density: Dense networks train faster but use more memory. Sparse networks (density<0.25) benefit from sparse mode

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org