NetFilter

Script: netfilter.sh Package: ml Class: NetFilter.java

Scores sequences using a neural network with multithreaded processing for sequence classification. Provides configurable filtering modes (highpass/lowpass cutoffs), reverse complement scoring, and statistical metrics accumulation (TPR, FPR, score distributions) for biological sequence screening.

Basic Usage

netfilter.sh in=<sequences> out=<pass> outu=<fail> net=<net file>

Input may be fasta or fastq, compressed or uncompressed.

Parameters

Parameters are organized by their function in the neural network filtering process. The tool provides four scoring modes (single, average, max, min) and binary filtering via configurable cutoff thresholds for sequence classification.

Standard Parameters

in=<file>
Input sequences. Accepts fasta or fastq format, compressed or uncompressed.
out=<file>
Sequences passing the filter. Output format matches input format.
outu=<file>
Sequences failing the filter. Alternative output for rejected sequences.
net=<file>
Network file to apply to the sequences. Must be a trained neural network in BBNet format.
hist=<file>
Histogram of scores (x100, so 0-1 maps to 0-100). Useful for analyzing score distributions.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: f

Processing Parameters

rcomp=f
Use the max score of a sequence and its reverse complement. Useful for double-stranded analysis. Default: f
parse=f
Parse sequence headers for 'result=' to determine whether they are positive or negative examples. Positive examples should be annotated with result=1, and negative with result=0. Default: f
annotate=f
Rename output sequences by appending '\tscore=X.XXXX' to sequence headers using String.format("\tscore=%.4f", score). Provides traceability of neural network scores in downstream analysis. Applied after all scoring calculations (single/average/max/min modes, reverse complement, pair mode). Default: f
filter=t
Retain only reads above or below a cutoff. Setting the cutoff or highpass flag will automatically set this to true. Default: t
cutoff=auto
Score cutoff for filtering; neural network scores typically range from 0 to 1. Setting cutoff=auto uses net.cutoff embedded in the BBNet file. Custom values override embedded cutoffs. Filtering logic: pass = (score>=cutoff)==highpass, where scores are compared after all processing (pair mode, reverse complement, etc.). Default: auto
highpass=t
Retain sequences ABOVE cutoff if true, else BELOW cutoff. Default: t
scoremode=single
Scoring strategy for sequences:
  • single (default): Apply scoreSingle() once per sequence using the first W bases (where W = network width). Fastest method for sequences longer than network input size.
  • average: Apply scoreFrames() with sliding windows every 'stepsize' bases, return arithmetic mean of all window scores. Computationally intensive.
  • max: Like average but return maximum window score using Tools.max(). Useful for identifying best-matching regions.
  • min: Like average but return minimum window score using Tools.min(). Conservative classification requiring all regions to meet criteria.
Default: single
pairmode=average
Scoring strategy for paired reads:
  • average (default): Calculate (score1+score2)*0.5f for paired reads.
  • max: Use Tools.max(score1, score2) - higher of the two scores.
  • min: Use Tools.min(score1, score2) - lower of the two scores.
Implementation handles single reads by using score1 for both calculations. Default: average
stepsize=1
If scoremode is other than 'single', score a window every this many bases using scoreFrames() method. The window width is defined by the network input dimensions. Higher stepsize values reduce computational load but may miss features between windows. Calculation: start+=stepsize, stop+=stepsize for each iteration. Default: 1
overlap=
Alternative to stepsize specification; if either flag is used it will override the other. Controls window overlap in sliding window scoring modes. Implementation: stepsize = width - overlap. Setting overlap=0 is equivalent to stepsize=W (network width), meaning no overlap between consecutive windows. Higher overlap values provide finer-grained scoring at computational cost.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions. Can improve performance in production environments.

Examples

Basic Sequence Classification

netfilter.sh in=reads.fq out=classified.fq outu=unclassified.fq net=model.bbnet

Classify sequences using a trained neural network, separating passing sequences from failing ones.

Scoring with Reverse Complement Analysis

netfilter.sh in=sequences.fa out=positive.fa outu=negative.fa net=classifier.bbnet rcomp=t annotate=t

Score sequences considering both forward and reverse complement orientations, annotating output with scores.

Sliding Window Scoring

netfilter.sh in=long_reads.fq out=filtered.fq net=window_model.bbnet scoremode=average stepsize=50

Apply sliding window scoring with 50-base steps, using average scores across windows for classification.

Custom Cutoff with Statistics

netfilter.sh in=test_data.fq out=pass.fq outu=fail.fq net=model.bbnet cutoff=0.7 hist=score_distribution.txt parse=t

Use custom score cutoff of 0.7, parse sequence headers for ground truth labels, and generate score histogram.

Algorithm Details

Neural Network Processing

NetFilter implements multithreaded neural network inference using Shared.threads() for thread count determination. Each worker thread maintains its own copy of the neural network (CellNet.copy()) to avoid synchronization overhead. The system uses ConcurrentReadInputStream and ConcurrentReadOutputStream with configurable buffer sizes (8-128 based on thread count and ordering requirements). Default memory allocation is 2GB with scaling via freeRam() calculation (85% of physical memory maximum).

Scoring Modes

The tool supports four distinct scoring modes:

Reverse Complement Processing

When rcomp=t, the algorithm processes both the forward sequence and its reverse complement, taking the maximum score between orientations. The implementation uses an efficient in-place algorithm:

  1. Score the original sequence using scoreSingle() or scoreFrames()
  2. In-place reverse complement using AminoAcid.reverseComplementBasesInPlace() - no memory allocation
  3. Score the reverse complement with the same scoring method
  4. Restore original sequence orientation with another in-place reverse complement
  5. Return Tools.max(forward_score, reverse_score)

This approach avoids creating new byte arrays, minimizing memory overhead and garbage collection pressure during processing.

Paired-End Handling

For paired-end reads, the tool provides three combination strategies:

Sequence-to-Vector Conversion

The tool uses SequenceToVector.fillVector() to convert DNA sequences into floating-point input vectors for neural network processing. The conversion process operates on raw byte arrays and supports both full-sequence and windowed processing:

Each thread maintains its own float[] vector to avoid synchronization during parallel processing.

Performance Characteristics

Statistical Analysis

NetFilter provides statistical analysis with thread-safe accumulation of metrics via synchronized blocks in the Accumulator pattern:

Statistical output includes detailed breakdowns: "Average Score: X.XXXX", "True Positive: N (X.XX%)", with all calculations performed at program completion.

Network Compatibility

NetFilter loads neural networks using CellNetParser.load() from BBNet format files. The BBNet format specification includes:

Network loading includes validation: assert(net!=null) ensures successful parsing, and dimension consistency is verified against specified width parameters.

Thread Safety and Synchronization

NetFilter implements thread safety through ReentrantReadWriteLock synchronization and per-thread resource isolation:

Memory Management

The tool implements several memory optimization strategies:

Support

For questions and support: