FilterBySequence
Filters sequences by exact sequence matches. This can also handle inexact matches but is extremely slow in that mode.
Basic Usage
filterbysequence.sh in=<file> out=<file> ref=<file> include=<t/f>
FilterBySequence filters sequences based on exact or approximate matches to reference sequences. It can operate in two modes: inclusion (keeping matching sequences) or exclusion (removing matching sequences). The tool supports both reference files and literal sequences as filtering criteria.
Parameters
Parameters are organized by their functional role in the filtering process. Each parameter controls different aspects of sequence matching and processing behavior.
I/O Parameters
- in=
- Primary input file. Use 'in2' to specify a second file for paired-end data. Can accept FASTA or FASTQ formats, including compressed files.
- out=
- Primary output file. Use 'out2' to specify a second file for paired-end data. Output format matches input format.
- ref=
- Reference file containing sequences to match against. Can be a single file or comma-delimited list of files. Accepts FASTA or FASTQ formats.
- literal=
- Literal sequence or comma-delimited list of sequences to match against. Use this for small numbers of specific sequences instead of reference files.
- ow=t
- (overwrite) Overwrites files that already exist. Set to false to prevent accidental overwriting of existing output files.
Processing Parameters
- include=f
- Set to 'true' to include (keep) matching sequences rather than excluding (removing) them. When false, matching sequences are filtered out.
- rcomp=t
- Match reverse complements as well as forward sequences. Essential for DNA sequence filtering where orientation may vary.
- case=f
- (casesensitive) Require matching case for sequence comparison. When false, case is ignored for matching. Requires storebases=true.
- storebases=t
- (sb) Store reference bases in memory. When true, Code constructor stores byte[] bases for Tools.equals() comparison. When false, bases=null and equals() relies on 64-bit hash pair comparison (a,b values from Dedupe.hash() methods).
- threads=auto
- (t) Number of worker threads for processing. Auto-detection uses Shared.threads() to create ProcessThread and LoadThread instances for concurrent read processing via ConcurrentReadInputStream.
- subs=0
- Maximum number of substitutions allowed in approximate matching. Higher values increase sensitivity but drastically reduce speed.
- mf=0.0
- (mismatchfraction) Maximum fraction of bases that can mismatch. The actual number allowed is max(subs, mf*min(query.length, ref.length)). Range: 0.0-1.0.
- lengthdif=0
- Maximum allowed length difference between query and reference sequences. Allows matching of sequences with different lengths within this tolerance.
Java Parameters
- -Xmx
- Sets Java's memory usage, overriding autodetection. Use format like -Xmx20g for 20 GB or -Xmx200m for 200 MB. Maximum is typically 85% of physical memory.
- -eoom
- Exit process if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing hung processes in memory-constrained environments.
- -da
- Disable Java assertions. May provide slight performance improvement in production environments.
Examples
Basic Sequence Filtering
# Remove contaminating sequences from reads
filterbysequence.sh in=reads.fq out=clean.fq ref=contaminants.fa include=f
# Keep only sequences matching specific targets
filterbysequence.sh in=reads.fq out=targets.fq ref=targets.fa include=t
Basic filtering operations showing exclusion (contamination removal) and inclusion (target enrichment).
Literal Sequence Filtering
# Filter using specific sequences
filterbysequence.sh in=reads.fq out=filtered.fq literal=ATGCGTACGT,GCTAGCTAGC include=f
# Remove adapter sequences
filterbysequence.sh in=reads.fq out=clean.fq \
literal=AGATCGGAAGAGC,CTGTCTCTTATACACATCT include=f
Using literal sequences for targeted filtering without needing reference files.
Approximate Matching
# Allow up to 2 mismatches
filterbysequence.sh in=reads.fq out=filtered.fq ref=targets.fa subs=2 include=t
# Allow 5% mismatch rate
filterbysequence.sh in=reads.fq out=filtered.fq ref=targets.fa mf=0.05 include=t
# Flexible length matching
filterbysequence.sh in=reads.fq out=filtered.fq ref=targets.fa lengthdif=10 include=t
Approximate matching examples. Note: these modes are significantly slower than exact matching.
Memory-Optimized Filtering
# Probabilistic matching for large reference sets
filterbysequence.sh in=reads.fq out=filtered.fq ref=large_db.fa \
storebases=f case=f include=f
# High-memory exact matching
filterbysequence.sh in=reads.fq out=filtered.fq ref=targets.fa \
storebases=t case=t include=t -Xmx32g
Memory usage optimization strategies for different scenarios.
Algorithm Details
Exact Matching Strategy
FilterBySequence implements a two-tier hashing strategy for exact sequence matching using the Code class constructor:
- Dual Hash System: Code class computes forward hash via Dedupe.hash(bases_) and reverse hash via Dedupe.hashReversed(bases_), storing a=Tools.max(fwd,rev) and b=Tools.min(fwd,rev) when rcomp=true
- 64-bit Hash Pairs: Each sequence represented by two long values (a,b) for both orientations, with hashCode() method returning (int)(a&0x7FFFFFFF)
- Memory Optimization: When storebases=false, bases field set to null and equals() method relies solely on hash comparison (a!=c.a || b!=c.b)
- Case Handling: Code constructor applies Tools.toUpperCase(bases[i]) transformation when toUpperCase=true (default case-insensitive mode)
Reference Loading Process
Reference sequences are loaded using multithreaded processing via LoadThread instances:
- Concurrent Loading: Multiple LoadThread instances (equal to Shared.threads() count) process reference files via ConcurrentReadInputStream
- Batch Processing: LoadThread accumulates Code objects in LinkedHashSet<Code> buffer, flushing to shared refSet when codes.size()>2000
- Memory Management: synchronized(refSet) blocks prevent race conditions during addToRef() calls from multiple LoadThreads
- Deduplication: HashSet<Code> refSet naturally eliminates duplicates via Code.equals() method comparing hash pairs and bases arrays
Filtering Algorithm
The core filtering logic operates through ProcessThread instances executing contains() method chain:
- Hash Lookup: contains() method first calls refSet.contains(c) where c=new Code(r.bases) for O(1) HashSet lookup
- Paired-End Support: containsBoth() requires both match1=contains(r1) && match2=contains(r2) to return true
- Reverse Complement: Code constructor automatically handles both orientations when rcomp=true via Tools.max(fwd,rev) selection
- Empty Sequence Handling: contains() method explicitly returns true when r.bases==null || r.bases.length<1
Approximate Matching (Slow Mode)
When maxSubs>0 || mismatchFraction>0 || maxLengthDif>0, contains() calls bruteForce() method:
- Linear Search: bruteForce() calls compare(bases) which iterates through ArrayList<byte[]> refList
- Length Filtering: compare() pre-filters using minlen=bases.length-maxLengthDif, maxlen=bases.length+maxLengthDif bounds
- Sliding Window: compare(byte[] a, byte[] b) slides shorter sequence along longer using aStartMax=a.length-b.length loop
- Mismatch Counting: Inner loop counts subs+=(a[apos]==b[bpos] ? 0 : 1), breaking when subs>maxSubs0
Performance Characteristics
- Exact Mode: O(1) lookup via HashSet.contains() using Code.hashCode() returning (int)(a&0x7FFFFFFF)
- Memory Usage: Each Code object stores 2 long values (16 bytes) plus optional byte[] bases; storebases=false eliminates bases storage
- Thread Scalability: Shared.threads() ProcessThread instances process reads concurrently via ConcurrentReadInputStream
- Approximate Mode: O(n*m*k) where bruteForce() compares n queries against m refList entries with k-length sequence comparisons
Code Structure
The implementation uses several key classes with specific responsibilities:
- Code Class: Inner class with constructor Code(byte[] bases_) that computes hash pairs (a,b) via Dedupe.hash() methods and optionally stores bases array
- ProcessThread: Extends Thread, implements processReadPair() calling containsBoth() which uses refSet.contains() for filtering
- LoadThread: Extends Thread, accumulates sequences in LinkedHashSet<Code> buffer and calls addToRef() to populate shared refSet
- Synchronized Collections: HashSet<Code> refSet for O(1) exact lookups, ArrayList<byte[]> refList for linear approximate matching
Performance Considerations
Speed Optimization
- Use exact matching (subs=0, mf=0.0) to leverage HashSet.contains() O(1) lookup performance
- Set storebases=false for large reference sets to eliminate byte[] bases storage in Code objects
- Use literal= for small sequences to avoid FileFormat.testInput() and ConcurrentReadInputStream overhead
- Consider case=false to avoid Tools.toUpperCase() transformation in Code constructor
Memory Management
- Default z="-Xmx800m" allocation from calcXmx() function works for most datasets
- Increase -Xmx when HashSet<Code> refSet plus ArrayList<byte[]> refList exceed available memory
- Monitor memory when Code objects store both hash pairs (16 bytes) and bases arrays (sequence length bytes)
- Use storebases=false to reduce Code memory footprint by eliminating bases storage
Approximate Matching Caveats
- Enabling subs>0 or mf>0 switches from O(1) HashSet lookup to O(n*m) bruteForce() linear search
- Runtime increases with reference set size due to ArrayList<byte[]> refList iteration in compare() method
- Consider refSet.contains() filtering before bruteForce() to eliminate obvious non-matches
- Use restrictive maxSubs and mismatchFraction to minimize sliding window iterations in compare(byte[] a, byte[] b)
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org