NetFilter

Basic Usage

netfilter.sh in=<sequences> out=<pass> outu=<fail> net=<net file>

Input may be fasta or fastq, compressed or uncompressed.

Parameters

Parameters are organized by their function in the neural network filtering process. The tool provides four scoring modes (single, average, max, min) and binary filtering via configurable cutoff thresholds for sequence classification.

Standard Parameters

in=<file>: Input sequences. Accepts fasta or fastq format, compressed or uncompressed.
out=<file>: Sequences passing the filter. Output format matches input format.
outu=<file>: Sequences failing the filter. Alternative output for rejected sequences.
net=<file>: Network file to apply to the sequences. Must be a trained neural network in BBNet format.
hist=<file>: Histogram of scores (x100, so 0-1 maps to 0-100). Useful for analyzing score distributions.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: f

Processing Parameters

rcomp=f

Use the max score of a sequence and its reverse complement. Useful for double-stranded analysis. Default: f

parse=f

Parse sequence headers for 'result=' to determine whether they are positive or negative examples. Positive examples should be annotated with result=1, and negative with result=0. Default: f

annotate=f

Rename output sequences by appending '\tscore=X.XXXX' to sequence headers using String.format("\tscore=%.4f", score). Provides traceability of neural network scores in downstream analysis. Applied after all scoring calculations (single/average/max/min modes, reverse complement, pair mode). Default: f

filter=t

Retain only reads above or below a cutoff. Setting the cutoff or highpass flag will automatically set this to true. Default: t

cutoff=auto

Score cutoff for filtering; neural network scores typically range from 0 to 1. Setting cutoff=auto uses net.cutoff embedded in the BBNet file. Custom values override embedded cutoffs. Filtering logic: pass = (score>=cutoff)==highpass, where scores are compared after all processing (pair mode, reverse complement, etc.). Default: auto

highpass=t

Retain sequences ABOVE cutoff if true, else BELOW cutoff. Default: t

scoremode=single

Scoring strategy for sequences:

single (default): Apply scoreSingle() once per sequence using the first W bases (where W = network width). Fastest method for sequences longer than network input size.
average: Apply scoreFrames() with sliding windows every 'stepsize' bases, return arithmetic mean of all window scores. Computationally intensive.
max: Like average but return maximum window score using Tools.max(). Useful for identifying best-matching regions.
min: Like average but return minimum window score using Tools.min(). Conservative classification requiring all regions to meet criteria.

Default: single

pairmode=average

Scoring strategy for paired reads:

average (default): Calculate (score1+score2)*0.5f for paired reads.
max: Use Tools.max(score1, score2) - higher of the two scores.
min: Use Tools.min(score1, score2) - lower of the two scores.

Implementation handles single reads by using score1 for both calculations. Default: average

stepsize=1

If scoremode is other than 'single', score a window every this many bases using scoreFrames() method. The window width is defined by the network input dimensions. Higher stepsize values reduce computational load but may miss features between windows. Calculation: start+=stepsize, stop+=stepsize for each iteration. Default: 1

overlap=

Alternative to stepsize specification; if either flag is used it will override the other. Controls window overlap in sliding window scoring modes. Implementation: stepsize = width - overlap. Setting overlap=0 is equivalent to stepsize=W (network width), meaning no overlap between consecutive windows. Higher overlap values provide finer-grained scoring at computational cost.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions. Can improve performance in production environments.

Examples

Basic Sequence Classification

netfilter.sh in=reads.fq out=classified.fq outu=unclassified.fq net=model.bbnet

Classify sequences using a trained neural network, separating passing sequences from failing ones.

Scoring with Reverse Complement Analysis

netfilter.sh in=sequences.fa out=positive.fa outu=negative.fa net=classifier.bbnet rcomp=t annotate=t

Score sequences considering both forward and reverse complement orientations, annotating output with scores.

Sliding Window Scoring

netfilter.sh in=long_reads.fq out=filtered.fq net=window_model.bbnet scoremode=average stepsize=50

Apply sliding window scoring with 50-base steps, using average scores across windows for classification.

Custom Cutoff with Statistics

netfilter.sh in=test_data.fq out=pass.fq outu=fail.fq net=model.bbnet cutoff=0.7 hist=score_distribution.txt parse=t

Use custom score cutoff of 0.7, parse sequence headers for ground truth labels, and generate score histogram.

Algorithm Details

Neural Network Processing

NetFilter implements multithreaded neural network inference using Shared.threads() for thread count determination. Each worker thread maintains its own copy of the neural network (CellNet.copy()) to avoid synchronization overhead. The system uses ConcurrentReadInputStream and ConcurrentReadOutputStream with configurable buffer sizes (8-128 based on thread count and ordering requirements). Default memory allocation is 2GB with scaling via freeRam() calculation (85% of physical memory maximum).

Scoring Modes

The tool supports four distinct scoring modes:

Single Mode: Applies the network once per sequence using the first W bases (where W is the network input width). This is the fastest method for sequences longer than the network input size.
Average Mode: Applies sliding window scoring across the entire sequence with configurable step sizes. The final score is the arithmetic mean of all window scores.
Maximum Mode: Like average mode but retains the highest scoring window as the final score. Useful for identifying the best-matching region.
Minimum Mode: Like average mode but uses the lowest scoring window. Useful for conservative classification requiring all regions to meet criteria.

Reverse Complement Processing

When rcomp=t, the algorithm processes both the forward sequence and its reverse complement, taking the maximum score between orientations. The implementation uses an efficient in-place algorithm:

Score the original sequence using scoreSingle() or scoreFrames()
In-place reverse complement using AminoAcid.reverseComplementBasesInPlace() - no memory allocation
Score the reverse complement with the same scoring method
Restore original sequence orientation with another in-place reverse complement
Return Tools.max(forward_score, reverse_score)

This approach avoids creating new byte arrays, minimizing memory overhead and garbage collection pressure during processing.

Paired-End Handling

For paired-end reads, the tool provides three combination strategies:

Average: Takes the mean of read 1 and read 2 scores
Maximum: Uses the higher score between the two reads
Minimum: Uses the lower score, requiring both reads to pass

Sequence-to-Vector Conversion

The tool uses SequenceToVector.fillVector() to convert DNA sequences into floating-point input vectors for neural network processing. The conversion process operates on raw byte arrays and supports both full-sequence and windowed processing:

Vector Size: Determined by (net.numInputs() - 4) / 4, typically accommodating k-mer or positional encoding
Windowed Processing: For sliding window modes, fillVector() is called with start/stop positions
Network Width: Automatically detected from neural network input dimensions
K-mer Support: Configurable k-mer size for sequence encoding (default varies by network)

Each thread maintains its own float[] vector to avoid synchronization during parallel processing.

Performance Characteristics

Memory Usage: Default 2GB (-Xmx2g -Xms2g) with automatic scaling via freeRam() calculation (85% of physical memory maximum)
Threading: Automatic thread scaling with Shared.threads(), each thread maintaining independent CellNet copy and float[] vector
I/O Optimization: Uses ByteFile.FORCE_MODE_BF2 for >2 threads, ConcurrentRead streams with configurable buffer sizes (8-128 based on thread count)
Read Validation: Disabled in worker threads (Read.VALIDATE_IN_CONSTRUCTOR=false) when threads <4 to increase processing speed
Garbage Collection: Minimized through in-place operations, reusable vectors, and efficient stream management

Statistical Analysis

NetFilter provides statistical analysis with thread-safe accumulation of metrics via synchronized blocks in the Accumulator pattern:

Classification Metrics: True/False Positive/Negative counts with automatic percentage calculation
Score Statistics: Average scores calculated separately for positive/negative examples and pass/fail categories
Score Histograms: Dual histograms (positive/negative) with 101 bins (0-100) when hist= specified
Header Parsing: Extracts ground truth from 'result=' annotations in sequence headers for validation
Thread Accumulation: All statistics synchronized via Accumulator pattern with per-thread counters

Statistical output includes detailed breakdowns: "Average Score: X.XXXX", "True Positive: N (X.XX%)", with all calculations performed at program completion.

Network Compatibility

NetFilter loads neural networks using CellNetParser.load() from BBNet format files. The BBNet format specification includes:

Embedded Cutoffs: net.cutoff values for automatic thresholding when cutoff=auto
Input Dimensions: Automatically detected network width via (net.numInputs()-4)/4
Feed-Forward Architecture: Compatible with CellNet.feedForward() processing
Copy Support: Networks support efficient copying (net.copy(false)) for multithreaded processing
Vector Input: Designed for SequenceToVector.fillVector() input processing

Network loading includes validation: assert(net!=null) ensures successful parsing, and dimension consistency is verified against specified width parameters.

Thread Safety and Synchronization

NetFilter implements thread safety through ReentrantReadWriteLock synchronization and per-thread resource isolation:

Read-Write Locks: Uses ReentrantReadWriteLock for shared resource access
Accumulator Pattern: Thread-safe statistics accumulation with synchronized blocks
Per-Thread Resources: Each ProcessThread maintains independent CellNet copy and float[] vector
Stream Management: ConcurrentReadInputStream/OutputStream handle thread coordination
ThreadWaiter: Manages thread lifecycle with success/failure tracking

Memory Management

The tool implements several memory optimization strategies:

Automatic Memory Scaling: Uses calcmem.sh and freeRam() to determine optimal -Xmx values
In-Place Operations: AminoAcid.reverseComplementBasesInPlace() avoids array allocation
Buffer Management: Configurable buffer sizes (8-128) based on thread count and ordering requirements
Vector Reuse: Each thread reuses the same float[] vector across all sequences
Stream Optimization: ByteFile.FORCE_MODE_BF2 for efficient I/O when >2 threads

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org