ScoreSequence

Basic Usage

scoresequence.sh in=<sequences> out=<renamed sequences> net=<net file>

Input may be fasta or fastq, compressed or uncompressed. The tool applies a neural network to score sequences and can filter, annotate, or generate histograms based on the scores.

Parameters

Parameters are organized by their function in the sequence scoring process. All parameters from the shell script usage function are documented below.

Standard parameters

in=<file>: Input sequence data. Can be fasta or fastq format, compressed or uncompressed.
out=<file>: Output sequences renamed with their scores. Only sequences passing the filter (if enabled) will be written.
net=<file>: Neural network file (.bbnet format) to apply to the sequences. This contains the trained model weights and architecture.
hist=<file>: Output histogram of scores (x100, so scores 0-1 map to bins 0-100). Generates separate counts for positive and negative examples if parsing headers.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: f

Processing parameters

rcomp=f: Use the maximum score of a sequence and its reverse complement. When true, both orientations are scored and the higher score is used. Default: f
parse=f: Parse sequence headers for 'result=' field to determine whether they are positive or negative examples. Used for generating separate histogram bins. Default: f
annotate=t: Rename output reads by appending 'score=X.XXXX' to the sequence identifier. Default: t
filter=f: Retain only reads above or below a cutoff score. Setting the cutoff or highpass flag will automatically set this to true. Default: f
cutoff=0.5: Score cutoff for filtering. Scores typically range from 0 to 1. Sequences are retained based on the highpass setting. Default: 0.5
highpass=t: Retain sequences ABOVE cutoff if true, else retain sequences BELOW cutoff. Default: t

Advanced parameters

width=<int>: Width of the sequence window for neural network input. If not specified, automatically calculated from network architecture as (numInputs-4)/4.
k=<int>: Kmer length for sequence encoding. Used by SequenceToVector for converting sequences to neural network input vectors. Default: 0 (auto-detect)
lowpass=<boolean>: Inverse of highpass. When true, retains sequences BELOW the cutoff. Automatically sets filter=true.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic sequence scoring

scoresequence.sh in=sequences.fasta out=scored.fasta net=model.bbnet

Score sequences using a trained neural network and output with score annotations in headers.

Filtering with score cutoff

scoresequence.sh in=reads.fastq out=filtered.fastq net=classifier.bbnet cutoff=0.7 filter=t

Filter sequences, keeping only those with scores above 0.7.

Generate score histogram

scoresequence.sh in=test_data.fasta net=model.bbnet hist=score_distribution.txt parse=t

Generate a histogram of scores without output sequences, parsing headers to separate positive and negative examples.

Reverse complement scoring

scoresequence.sh in=dna.fasta out=scored_rcomp.fasta net=strand_model.bbnet rcomp=t

Score both orientations of each sequence and use the maximum score.

Algorithm Details

Neural Network Scoring Process

ScoreSequence applies trained neural networks to biological sequences using the following process:

Sequence Encoding: DNA sequences are converted to numerical vectors using SequenceToVector.fillVector(), which creates a fixed-width representation suitable for neural network input
Network Application: The CellNet class handles the neural network computation, applying the trained weights through feedforward propagation
Score Generation: Each sequence receives a numerical score, typically ranging from 0 to 1, representing the network's confidence or classification

Reverse Complement Processing

When rcomp=true, the algorithm performs dual-orientation scoring:

Scores the original sequence orientation using net.feedForward()
Generates the reverse complement using AminoAcid.reverseComplementBasesInPlace() on the byte array
Scores the reverse complement orientation with SequenceToVector.fillVector() and net.feedForward()
Returns the maximum of both scores using Tools.max(f, r) where f=forward, r=reverse
Restores the original sequence orientation with a second reverseComplementBasesInPlace() call

Filtering and Annotation

The tool provides flexible output options:

Score Annotation: When annotate=true, sequence identifiers are modified using String.format("\tscore=%.4f", score) appended to r.id
Threshold Filtering: Sequences are filtered using boolean logic (score>=cutoff)==highpass where pass sequences increment readsOut counter and are written to output via bsw.print()
Histogram Generation: Score distributions use LongList arrays (phist/mhist) with 101 bins where scores are scaled by 100 using Tools.mid(0, Math.round(score*100), 100) and categorized by result header field parsing

Memory and Performance

The implementation uses specific optimizations:

Streaming Processing: Sequences are processed in batches using ConcurrentReadInputStream with automatic list management and garbage collection
Vector Reuse: A single float array (width*4+4 elements) is reused for all sequence-to-vector conversions via SequenceToVector.fillVector() to minimize allocation overhead
Network Width Calculation: Width is automatically calculated as (numInputs-4)/4 from the CellNet architecture when not specified

Input Validation

The constructor performs validation checks:

Neural network loaded via CellNetParser.load(netFile) with assert(net!=null) validation
Width dimension checked using assert(width==(net.numInputs()-4)/4) when manually specified
Input format detected using FileFormat.testInput(in1, FileFormat.FASTQ, null, true, true)

Technical Notes

Network File Format

The tool loads neural networks using CellNetParser.load() from .bbnet files, which contain:

Network architecture loaded into CellNet class (layer sizes, activation functions)
Trained weights and biases for feedforward computation
Input dimension validation via net.numInputs() for width calculation

Sequence Length Limitations

Only the initial width base pairs are processed by SequenceToVector.fillVector(bases, vec, k) for sequences longer than the network input dimensions. The width parameter determines the fixed window size for neural network input.

Score Interpretation

Score meanings depend on the specific neural network training:

Classification networks typically output probabilities (0-1 range)
Regression networks may use different score ranges
Higher scores generally indicate stronger positive classification

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org