NetFilter
Scores sequences using a neural network with multithreaded processing for sequence classification. Provides configurable filtering modes (highpass/lowpass cutoffs), reverse complement scoring, and statistical metrics accumulation (TPR, FPR, score distributions) for biological sequence screening.
Basic Usage
netfilter.sh in=<sequences> out=<pass> outu=<fail> net=<net file>
Input may be fasta or fastq, compressed or uncompressed.
Parameters
Parameters are organized by their function in the neural network filtering process. The tool provides four scoring modes (single, average, max, min) and binary filtering via configurable cutoff thresholds for sequence classification.
Standard Parameters
- in=<file>
- Input sequences. Accepts fasta or fastq format, compressed or uncompressed.
- out=<file>
- Sequences passing the filter. Output format matches input format.
- outu=<file>
- Sequences failing the filter. Alternative output for rejected sequences.
- net=<file>
- Network file to apply to the sequences. Must be a trained neural network in BBNet format.
- hist=<file>
- Histogram of scores (x100, so 0-1 maps to 0-100). Useful for analyzing score distributions.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: f
Processing Parameters
- rcomp=f
- Use the max score of a sequence and its reverse complement. Useful for double-stranded analysis. Default: f
- parse=f
- Parse sequence headers for 'result=' to determine whether they are positive or negative examples. Positive examples should be annotated with result=1, and negative with result=0. Default: f
- annotate=f
- Rename output sequences by appending '\tscore=X.XXXX' to sequence headers using String.format("\tscore=%.4f", score). Provides traceability of neural network scores in downstream analysis. Applied after all scoring calculations (single/average/max/min modes, reverse complement, pair mode). Default: f
- filter=t
- Retain only reads above or below a cutoff. Setting the cutoff or highpass flag will automatically set this to true. Default: t
- cutoff=auto
- Score cutoff for filtering; neural network scores typically range from 0 to 1. Setting cutoff=auto uses net.cutoff embedded in the BBNet file. Custom values override embedded cutoffs. Filtering logic: pass = (score>=cutoff)==highpass, where scores are compared after all processing (pair mode, reverse complement, etc.). Default: auto
- highpass=t
- Retain sequences ABOVE cutoff if true, else BELOW cutoff. Default: t
- scoremode=single
- Scoring strategy for sequences:
- single (default): Apply scoreSingle() once per sequence using the first W bases (where W = network width). Fastest method for sequences longer than network input size.
- average: Apply scoreFrames() with sliding windows every 'stepsize' bases, return arithmetic mean of all window scores. Computationally intensive.
- max: Like average but return maximum window score using Tools.max(). Useful for identifying best-matching regions.
- min: Like average but return minimum window score using Tools.min(). Conservative classification requiring all regions to meet criteria.
- pairmode=average
- Scoring strategy for paired reads:
- average (default): Calculate (score1+score2)*0.5f for paired reads.
- max: Use Tools.max(score1, score2) - higher of the two scores.
- min: Use Tools.min(score1, score2) - lower of the two scores.
- stepsize=1
- If scoremode is other than 'single', score a window every this many bases using scoreFrames() method. The window width is defined by the network input dimensions. Higher stepsize values reduce computational load but may miss features between windows. Calculation: start+=stepsize, stop+=stepsize for each iteration. Default: 1
- overlap=
- Alternative to stepsize specification; if either flag is used it will override the other. Controls window overlap in sliding window scoring modes. Implementation: stepsize = width - overlap. Setting overlap=0 is equivalent to stepsize=W (network width), meaning no overlap between consecutive windows. Higher overlap values provide finer-grained scoring at computational cost.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions. Can improve performance in production environments.
Examples
Basic Sequence Classification
netfilter.sh in=reads.fq out=classified.fq outu=unclassified.fq net=model.bbnet
Classify sequences using a trained neural network, separating passing sequences from failing ones.
Scoring with Reverse Complement Analysis
netfilter.sh in=sequences.fa out=positive.fa outu=negative.fa net=classifier.bbnet rcomp=t annotate=t
Score sequences considering both forward and reverse complement orientations, annotating output with scores.
Sliding Window Scoring
netfilter.sh in=long_reads.fq out=filtered.fq net=window_model.bbnet scoremode=average stepsize=50
Apply sliding window scoring with 50-base steps, using average scores across windows for classification.
Custom Cutoff with Statistics
netfilter.sh in=test_data.fq out=pass.fq outu=fail.fq net=model.bbnet cutoff=0.7 hist=score_distribution.txt parse=t
Use custom score cutoff of 0.7, parse sequence headers for ground truth labels, and generate score histogram.
Algorithm Details
Neural Network Processing
NetFilter implements multithreaded neural network inference using Shared.threads() for thread count determination. Each worker thread maintains its own copy of the neural network (CellNet.copy()) to avoid synchronization overhead. The system uses ConcurrentReadInputStream and ConcurrentReadOutputStream with configurable buffer sizes (8-128 based on thread count and ordering requirements). Default memory allocation is 2GB with scaling via freeRam() calculation (85% of physical memory maximum).
Scoring Modes
The tool supports four distinct scoring modes:
- Single Mode: Applies the network once per sequence using the first W bases (where W is the network input width). This is the fastest method for sequences longer than the network input size.
- Average Mode: Applies sliding window scoring across the entire sequence with configurable step sizes. The final score is the arithmetic mean of all window scores.
- Maximum Mode: Like average mode but retains the highest scoring window as the final score. Useful for identifying the best-matching region.
- Minimum Mode: Like average mode but uses the lowest scoring window. Useful for conservative classification requiring all regions to meet criteria.
Reverse Complement Processing
When rcomp=t, the algorithm processes both the forward sequence and its reverse complement, taking the maximum score between orientations. The implementation uses an efficient in-place algorithm:
- Score the original sequence using scoreSingle() or scoreFrames()
- In-place reverse complement using AminoAcid.reverseComplementBasesInPlace() - no memory allocation
- Score the reverse complement with the same scoring method
- Restore original sequence orientation with another in-place reverse complement
- Return Tools.max(forward_score, reverse_score)
This approach avoids creating new byte arrays, minimizing memory overhead and garbage collection pressure during processing.
Paired-End Handling
For paired-end reads, the tool provides three combination strategies:
- Average: Takes the mean of read 1 and read 2 scores
- Maximum: Uses the higher score between the two reads
- Minimum: Uses the lower score, requiring both reads to pass
Sequence-to-Vector Conversion
The tool uses SequenceToVector.fillVector() to convert DNA sequences into floating-point input vectors for neural network processing. The conversion process operates on raw byte arrays and supports both full-sequence and windowed processing:
- Vector Size: Determined by (net.numInputs() - 4) / 4, typically accommodating k-mer or positional encoding
- Windowed Processing: For sliding window modes, fillVector() is called with start/stop positions
- Network Width: Automatically detected from neural network input dimensions
- K-mer Support: Configurable k-mer size for sequence encoding (default varies by network)
Each thread maintains its own float[] vector to avoid synchronization during parallel processing.
Performance Characteristics
- Memory Usage: Default 2GB (-Xmx2g -Xms2g) with automatic scaling via freeRam() calculation (85% of physical memory maximum)
- Threading: Automatic thread scaling with Shared.threads(), each thread maintaining independent CellNet copy and float[] vector
- I/O Optimization: Uses ByteFile.FORCE_MODE_BF2 for >2 threads, ConcurrentRead streams with configurable buffer sizes (8-128 based on thread count)
- Read Validation: Disabled in worker threads (Read.VALIDATE_IN_CONSTRUCTOR=false) when threads <4 to increase processing speed
- Garbage Collection: Minimized through in-place operations, reusable vectors, and efficient stream management
Statistical Analysis
NetFilter provides statistical analysis with thread-safe accumulation of metrics via synchronized blocks in the Accumulator pattern:
- Classification Metrics: True/False Positive/Negative counts with automatic percentage calculation
- Score Statistics: Average scores calculated separately for positive/negative examples and pass/fail categories
- Score Histograms: Dual histograms (positive/negative) with 101 bins (0-100) when hist= specified
- Header Parsing: Extracts ground truth from 'result=' annotations in sequence headers for validation
- Thread Accumulation: All statistics synchronized via Accumulator pattern with per-thread counters
Statistical output includes detailed breakdowns: "Average Score: X.XXXX", "True Positive: N (X.XX%)", with all calculations performed at program completion.
Network Compatibility
NetFilter loads neural networks using CellNetParser.load() from BBNet format files. The BBNet format specification includes:
- Embedded Cutoffs: net.cutoff values for automatic thresholding when cutoff=auto
- Input Dimensions: Automatically detected network width via (net.numInputs()-4)/4
- Feed-Forward Architecture: Compatible with CellNet.feedForward() processing
- Copy Support: Networks support efficient copying (net.copy(false)) for multithreaded processing
- Vector Input: Designed for SequenceToVector.fillVector() input processing
Network loading includes validation: assert(net!=null) ensures successful parsing, and dimension consistency is verified against specified width parameters.
Thread Safety and Synchronization
NetFilter implements thread safety through ReentrantReadWriteLock synchronization and per-thread resource isolation:
- Read-Write Locks: Uses ReentrantReadWriteLock for shared resource access
- Accumulator Pattern: Thread-safe statistics accumulation with synchronized blocks
- Per-Thread Resources: Each ProcessThread maintains independent CellNet copy and float[] vector
- Stream Management: ConcurrentReadInputStream/OutputStream handle thread coordination
- ThreadWaiter: Manages thread lifecycle with success/failure tracking
Memory Management
The tool implements several memory optimization strategies:
- Automatic Memory Scaling: Uses calcmem.sh and freeRam() to determine optimal -Xmx values
- In-Place Operations: AminoAcid.reverseComplementBasesInPlace() avoids array allocation
- Buffer Management: Configurable buffer sizes (8-128) based on thread count and ordering requirements
- Vector Reuse: Each thread reuses the same float[] vector across all sequences
- Stream Optimization: ByteFile.FORCE_MODE_BF2 for efficient I/O when >2 threads
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org