ScoreSequence
Scores sequences using a neural network. Only the initial Xbp are used, for sequences longer than the network size.
Basic Usage
scoresequence.sh in=<sequences> out=<renamed sequences> net=<net file>
Input may be fasta or fastq, compressed or uncompressed. The tool applies a neural network to score sequences and can filter, annotate, or generate histograms based on the scores.
Parameters
Parameters are organized by their function in the sequence scoring process. All parameters from the shell script usage function are documented below.
Standard parameters
- in=<file>
- Input sequence data. Can be fasta or fastq format, compressed or uncompressed.
- out=<file>
- Output sequences renamed with their scores. Only sequences passing the filter (if enabled) will be written.
- net=<file>
- Neural network file (.bbnet format) to apply to the sequences. This contains the trained model weights and architecture.
- hist=<file>
- Output histogram of scores (x100, so scores 0-1 map to bins 0-100). Generates separate counts for positive and negative examples if parsing headers.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: f
Processing parameters
- rcomp=f
- Use the maximum score of a sequence and its reverse complement. When true, both orientations are scored and the higher score is used. Default: f
- parse=f
- Parse sequence headers for 'result=' field to determine whether they are positive or negative examples. Used for generating separate histogram bins. Default: f
- annotate=t
- Rename output reads by appending 'score=X.XXXX' to the sequence identifier. Default: t
- filter=f
- Retain only reads above or below a cutoff score. Setting the cutoff or highpass flag will automatically set this to true. Default: f
- cutoff=0.5
- Score cutoff for filtering. Scores typically range from 0 to 1. Sequences are retained based on the highpass setting. Default: 0.5
- highpass=t
- Retain sequences ABOVE cutoff if true, else retain sequences BELOW cutoff. Default: t
Advanced parameters
- width=<int>
- Width of the sequence window for neural network input. If not specified, automatically calculated from network architecture as (numInputs-4)/4.
- k=<int>
- Kmer length for sequence encoding. Used by SequenceToVector for converting sequences to neural network input vectors. Default: 0 (auto-detect)
- lowpass=<boolean>
- Inverse of highpass. When true, retains sequences BELOW the cutoff. Automatically sets filter=true.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic sequence scoring
scoresequence.sh in=sequences.fasta out=scored.fasta net=model.bbnet
Score sequences using a trained neural network and output with score annotations in headers.
Filtering with score cutoff
scoresequence.sh in=reads.fastq out=filtered.fastq net=classifier.bbnet cutoff=0.7 filter=t
Filter sequences, keeping only those with scores above 0.7.
Generate score histogram
scoresequence.sh in=test_data.fasta net=model.bbnet hist=score_distribution.txt parse=t
Generate a histogram of scores without output sequences, parsing headers to separate positive and negative examples.
Reverse complement scoring
scoresequence.sh in=dna.fasta out=scored_rcomp.fasta net=strand_model.bbnet rcomp=t
Score both orientations of each sequence and use the maximum score.
Algorithm Details
Neural Network Scoring Process
ScoreSequence applies trained neural networks to biological sequences using the following process:
- Sequence Encoding: DNA sequences are converted to numerical vectors using SequenceToVector.fillVector(), which creates a fixed-width representation suitable for neural network input
- Network Application: The CellNet class handles the neural network computation, applying the trained weights through feedforward propagation
- Score Generation: Each sequence receives a numerical score, typically ranging from 0 to 1, representing the network's confidence or classification
Reverse Complement Processing
When rcomp=true, the algorithm performs dual-orientation scoring:
- Scores the original sequence orientation using net.feedForward()
- Generates the reverse complement using AminoAcid.reverseComplementBasesInPlace() on the byte array
- Scores the reverse complement orientation with SequenceToVector.fillVector() and net.feedForward()
- Returns the maximum of both scores using Tools.max(f, r) where f=forward, r=reverse
- Restores the original sequence orientation with a second reverseComplementBasesInPlace() call
Filtering and Annotation
The tool provides flexible output options:
- Score Annotation: When annotate=true, sequence identifiers are modified using String.format("\tscore=%.4f", score) appended to r.id
- Threshold Filtering: Sequences are filtered using boolean logic (score>=cutoff)==highpass where pass sequences increment readsOut counter and are written to output via bsw.print()
- Histogram Generation: Score distributions use LongList arrays (phist/mhist) with 101 bins where scores are scaled by 100 using Tools.mid(0, Math.round(score*100), 100) and categorized by result header field parsing
Memory and Performance
The implementation uses specific optimizations:
- Streaming Processing: Sequences are processed in batches using ConcurrentReadInputStream with automatic list management and garbage collection
- Vector Reuse: A single float array (width*4+4 elements) is reused for all sequence-to-vector conversions via SequenceToVector.fillVector() to minimize allocation overhead
- Network Width Calculation: Width is automatically calculated as (numInputs-4)/4 from the CellNet architecture when not specified
Input Validation
The constructor performs validation checks:
- Neural network loaded via CellNetParser.load(netFile) with assert(net!=null) validation
- Width dimension checked using assert(width==(net.numInputs()-4)/4) when manually specified
- Input format detected using FileFormat.testInput(in1, FileFormat.FASTQ, null, true, true)
Technical Notes
Network File Format
The tool loads neural networks using CellNetParser.load() from .bbnet files, which contain:
- Network architecture loaded into CellNet class (layer sizes, activation functions)
- Trained weights and biases for feedforward computation
- Input dimension validation via net.numInputs() for width calculation
Sequence Length Limitations
Only the initial width base pairs are processed by SequenceToVector.fillVector(bases, vec, k) for sequences longer than the network input dimensions. The width parameter determines the fixed window size for neural network input.
Score Interpretation
Score meanings depend on the specific neural network training:
- Classification networks typically output probabilities (0-1 range)
- Regression networks may use different score ranges
- Higher scores generally indicate stronger positive classification
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org