FilterBySequence

Basic Usage

filterbysequence.sh in=<file> out=<file> ref=<file> include=<t/f>

FilterBySequence filters sequences based on exact or approximate matches to reference sequences. It can operate in two modes: inclusion (keeping matching sequences) or exclusion (removing matching sequences). The tool supports both reference files and literal sequences as filtering criteria.

Parameters

Parameters are organized by their functional role in the filtering process. Each parameter controls different aspects of sequence matching and processing behavior.

I/O Parameters

in=: Primary input file. Use 'in2' to specify a second file for paired-end data. Can accept FASTA or FASTQ formats, including compressed files.
out=: Primary output file. Use 'out2' to specify a second file for paired-end data. Output format matches input format.
ref=: Reference file containing sequences to match against. Can be a single file or comma-delimited list of files. Accepts FASTA or FASTQ formats.
literal=: Literal sequence or comma-delimited list of sequences to match against. Use this for small numbers of specific sequences instead of reference files.
ow=t: (overwrite) Overwrites files that already exist. Set to false to prevent accidental overwriting of existing output files.

Processing Parameters

include=f: Set to 'true' to include (keep) matching sequences rather than excluding (removing) them. When false, matching sequences are filtered out.
rcomp=t: Match reverse complements as well as forward sequences. Essential for DNA sequence filtering where orientation may vary.
case=f: (casesensitive) Require matching case for sequence comparison. When false, case is ignored for matching. Requires storebases=true.
storebases=t: (sb) Store reference bases in memory. When true, Code constructor stores byte[] bases for Tools.equals() comparison. When false, bases=null and equals() relies on 64-bit hash pair comparison (a,b values from Dedupe.hash() methods).
threads=auto: (t) Number of worker threads for processing. Auto-detection uses Shared.threads() to create ProcessThread and LoadThread instances for concurrent read processing via ConcurrentReadInputStream.
subs=0: Maximum number of substitutions allowed in approximate matching. Higher values increase sensitivity but drastically reduce speed.
mf=0.0: (mismatchfraction) Maximum fraction of bases that can mismatch. The actual number allowed is max(subs, mf*min(query.length, ref.length)). Range: 0.0-1.0.
lengthdif=0: Maximum allowed length difference between query and reference sequences. Allows matching of sequences with different lengths within this tolerance.

Java Parameters

-Xmx: Sets Java's memory usage, overriding autodetection. Use format like -Xmx20g for 20 GB or -Xmx200m for 200 MB. Maximum is typically 85% of physical memory.
-eoom: Exit process if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing hung processes in memory-constrained environments.
-da: Disable Java assertions. May provide slight performance improvement in production environments.

Examples

Basic Sequence Filtering

# Remove contaminating sequences from reads
filterbysequence.sh in=reads.fq out=clean.fq ref=contaminants.fa include=f

# Keep only sequences matching specific targets
filterbysequence.sh in=reads.fq out=targets.fq ref=targets.fa include=t

Basic filtering operations showing exclusion (contamination removal) and inclusion (target enrichment).

Literal Sequence Filtering

# Filter using specific sequences
filterbysequence.sh in=reads.fq out=filtered.fq literal=ATGCGTACGT,GCTAGCTAGC include=f

# Remove adapter sequences
filterbysequence.sh in=reads.fq out=clean.fq \
    literal=AGATCGGAAGAGC,CTGTCTCTTATACACATCT include=f

Using literal sequences for targeted filtering without needing reference files.

Approximate Matching

# Allow up to 2 mismatches
filterbysequence.sh in=reads.fq out=filtered.fq ref=targets.fa subs=2 include=t

# Allow 5% mismatch rate
filterbysequence.sh in=reads.fq out=filtered.fq ref=targets.fa mf=0.05 include=t

# Flexible length matching
filterbysequence.sh in=reads.fq out=filtered.fq ref=targets.fa lengthdif=10 include=t

Approximate matching examples. Note: these modes are significantly slower than exact matching.

Memory-Optimized Filtering

# Probabilistic matching for large reference sets
filterbysequence.sh in=reads.fq out=filtered.fq ref=large_db.fa \
    storebases=f case=f include=f

# High-memory exact matching
filterbysequence.sh in=reads.fq out=filtered.fq ref=targets.fa \
    storebases=t case=t include=t -Xmx32g

Memory usage optimization strategies for different scenarios.

Algorithm Details

Exact Matching Strategy

FilterBySequence implements a two-tier hashing strategy for exact sequence matching using the Code class constructor:

Dual Hash System: Code class computes forward hash via Dedupe.hash(bases_) and reverse hash via Dedupe.hashReversed(bases_), storing a=Tools.max(fwd,rev) and b=Tools.min(fwd,rev) when rcomp=true
64-bit Hash Pairs: Each sequence represented by two long values (a,b) for both orientations, with hashCode() method returning (int)(a&0x7FFFFFFF)
Memory Optimization: When storebases=false, bases field set to null and equals() method relies solely on hash comparison (a!=c.a || b!=c.b)
Case Handling: Code constructor applies Tools.toUpperCase(bases[i]) transformation when toUpperCase=true (default case-insensitive mode)

Reference Loading Process

Reference sequences are loaded using multithreaded processing via LoadThread instances:

Concurrent Loading: Multiple LoadThread instances (equal to Shared.threads() count) process reference files via ConcurrentReadInputStream
Batch Processing: LoadThread accumulates Code objects in LinkedHashSet<Code> buffer, flushing to shared refSet when codes.size()>2000
Memory Management: synchronized(refSet) blocks prevent race conditions during addToRef() calls from multiple LoadThreads
Deduplication: HashSet<Code> refSet naturally eliminates duplicates via Code.equals() method comparing hash pairs and bases arrays

Filtering Algorithm

The core filtering logic operates through ProcessThread instances executing contains() method chain:

Hash Lookup: contains() method first calls refSet.contains(c) where c=new Code(r.bases) for O(1) HashSet lookup
Paired-End Support: containsBoth() requires both match1=contains(r1) && match2=contains(r2) to return true
Reverse Complement: Code constructor automatically handles both orientations when rcomp=true via Tools.max(fwd,rev) selection
Empty Sequence Handling: contains() method explicitly returns true when r.bases==null || r.bases.length<1

Approximate Matching (Slow Mode)

When maxSubs>0 || mismatchFraction>0 || maxLengthDif>0, contains() calls bruteForce() method:

Linear Search: bruteForce() calls compare(bases) which iterates through ArrayList<byte[]> refList
Length Filtering: compare() pre-filters using minlen=bases.length-maxLengthDif, maxlen=bases.length+maxLengthDif bounds
Sliding Window: compare(byte[] a, byte[] b) slides shorter sequence along longer using aStartMax=a.length-b.length loop
Mismatch Counting: Inner loop counts subs+=(a[apos]==b[bpos] ? 0 : 1), breaking when subs>maxSubs0

Performance Characteristics

Exact Mode: O(1) lookup via HashSet.contains() using Code.hashCode() returning (int)(a&0x7FFFFFFF)
Memory Usage: Each Code object stores 2 long values (16 bytes) plus optional byte[] bases; storebases=false eliminates bases storage
Thread Scalability: Shared.threads() ProcessThread instances process reads concurrently via ConcurrentReadInputStream
Approximate Mode: O(n*m*k) where bruteForce() compares n queries against m refList entries with k-length sequence comparisons

Code Structure

The implementation uses several key classes with specific responsibilities:

Code Class: Inner class with constructor Code(byte[] bases_) that computes hash pairs (a,b) via Dedupe.hash() methods and optionally stores bases array
ProcessThread: Extends Thread, implements processReadPair() calling containsBoth() which uses refSet.contains() for filtering
LoadThread: Extends Thread, accumulates sequences in LinkedHashSet<Code> buffer and calls addToRef() to populate shared refSet
Synchronized Collections: HashSet<Code> refSet for O(1) exact lookups, ArrayList<byte[]> refList for linear approximate matching

Performance Considerations

Speed Optimization

Use exact matching (subs=0, mf=0.0) to leverage HashSet.contains() O(1) lookup performance
Set storebases=false for large reference sets to eliminate byte[] bases storage in Code objects
Use literal= for small sequences to avoid FileFormat.testInput() and ConcurrentReadInputStream overhead
Consider case=false to avoid Tools.toUpperCase() transformation in Code constructor

Memory Management

Default z="-Xmx800m" allocation from calcXmx() function works for most datasets
Increase -Xmx when HashSet<Code> refSet plus ArrayList<byte[]> refList exceed available memory
Monitor memory when Code objects store both hash pairs (16 bytes) and bases arrays (sequence length bytes)
Use storebases=false to reduce Code memory footprint by eliminating bases storage

Approximate Matching Caveats

Enabling subs>0 or mf>0 switches from O(1) HashSet lookup to O(n*m) bruteForce() linear search
Runtime increases with reference set size due to ArrayList<byte[]> refList iteration in compare() method
Consider refSet.contains() filtering before bruteForce() to eliminate obvious non-matches
Use restrictive maxSubs and mismatchFraction to minimize sliding window iterations in compare(byte[] a, byte[] b)

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org