ScoreSequence

Script: scoresequence.sh Package: ml Class: ScoreSequence.java

Scores sequences using a neural network. Only the initial Xbp are used, for sequences longer than the network size.

Basic Usage

scoresequence.sh in=<sequences> out=<renamed sequences> net=<net file>

Input may be fasta or fastq, compressed or uncompressed. The tool applies a neural network to score sequences and can filter, annotate, or generate histograms based on the scores.

Parameters

Parameters are organized by their function in the sequence scoring process. All parameters from the shell script usage function are documented below.

Standard parameters

in=<file>
Input sequence data. Can be fasta or fastq format, compressed or uncompressed.
out=<file>
Output sequences renamed with their scores. Only sequences passing the filter (if enabled) will be written.
net=<file>
Neural network file (.bbnet format) to apply to the sequences. This contains the trained model weights and architecture.
hist=<file>
Output histogram of scores (x100, so scores 0-1 map to bins 0-100). Generates separate counts for positive and negative examples if parsing headers.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: f

Processing parameters

rcomp=f
Use the maximum score of a sequence and its reverse complement. When true, both orientations are scored and the higher score is used. Default: f
parse=f
Parse sequence headers for 'result=' field to determine whether they are positive or negative examples. Used for generating separate histogram bins. Default: f
annotate=t
Rename output reads by appending 'score=X.XXXX' to the sequence identifier. Default: t
filter=f
Retain only reads above or below a cutoff score. Setting the cutoff or highpass flag will automatically set this to true. Default: f
cutoff=0.5
Score cutoff for filtering. Scores typically range from 0 to 1. Sequences are retained based on the highpass setting. Default: 0.5
highpass=t
Retain sequences ABOVE cutoff if true, else retain sequences BELOW cutoff. Default: t

Advanced parameters

width=<int>
Width of the sequence window for neural network input. If not specified, automatically calculated from network architecture as (numInputs-4)/4.
k=<int>
Kmer length for sequence encoding. Used by SequenceToVector for converting sequences to neural network input vectors. Default: 0 (auto-detect)
lowpass=<boolean>
Inverse of highpass. When true, retains sequences BELOW the cutoff. Automatically sets filter=true.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic sequence scoring

scoresequence.sh in=sequences.fasta out=scored.fasta net=model.bbnet

Score sequences using a trained neural network and output with score annotations in headers.

Filtering with score cutoff

scoresequence.sh in=reads.fastq out=filtered.fastq net=classifier.bbnet cutoff=0.7 filter=t

Filter sequences, keeping only those with scores above 0.7.

Generate score histogram

scoresequence.sh in=test_data.fasta net=model.bbnet hist=score_distribution.txt parse=t

Generate a histogram of scores without output sequences, parsing headers to separate positive and negative examples.

Reverse complement scoring

scoresequence.sh in=dna.fasta out=scored_rcomp.fasta net=strand_model.bbnet rcomp=t

Score both orientations of each sequence and use the maximum score.

Algorithm Details

Neural Network Scoring Process

ScoreSequence applies trained neural networks to biological sequences using the following process:

Reverse Complement Processing

When rcomp=true, the algorithm performs dual-orientation scoring:

  1. Scores the original sequence orientation using net.feedForward()
  2. Generates the reverse complement using AminoAcid.reverseComplementBasesInPlace() on the byte array
  3. Scores the reverse complement orientation with SequenceToVector.fillVector() and net.feedForward()
  4. Returns the maximum of both scores using Tools.max(f, r) where f=forward, r=reverse
  5. Restores the original sequence orientation with a second reverseComplementBasesInPlace() call

Filtering and Annotation

The tool provides flexible output options:

Memory and Performance

The implementation uses specific optimizations:

Input Validation

The constructor performs validation checks:

Technical Notes

Network File Format

The tool loads neural networks using CellNetParser.load() from .bbnet files, which contain:

Sequence Length Limitations

Only the initial width base pairs are processed by SequenceToVector.fillVector(bases, vec, k) for sequences longer than the network input dimensions. The width parameter determines the fixed window size for neural network input.

Score Interpretation

Score meanings depend on the specific neural network training:

Support

For questions and support: