FilterBySequence

Script: filterbysequence.sh Package: jgi Class: FilterBySequence.java

Filters sequences by exact sequence matches. This can also handle inexact matches but is extremely slow in that mode.

Basic Usage

filterbysequence.sh in=<file> out=<file> ref=<file> include=<t/f>

FilterBySequence filters sequences based on exact or approximate matches to reference sequences. It can operate in two modes: inclusion (keeping matching sequences) or exclusion (removing matching sequences). The tool supports both reference files and literal sequences as filtering criteria.

Parameters

Parameters are organized by their functional role in the filtering process. Each parameter controls different aspects of sequence matching and processing behavior.

I/O Parameters

in=
Primary input file. Use 'in2' to specify a second file for paired-end data. Can accept FASTA or FASTQ formats, including compressed files.
out=
Primary output file. Use 'out2' to specify a second file for paired-end data. Output format matches input format.
ref=
Reference file containing sequences to match against. Can be a single file or comma-delimited list of files. Accepts FASTA or FASTQ formats.
literal=
Literal sequence or comma-delimited list of sequences to match against. Use this for small numbers of specific sequences instead of reference files.
ow=t
(overwrite) Overwrites files that already exist. Set to false to prevent accidental overwriting of existing output files.

Processing Parameters

include=f
Set to 'true' to include (keep) matching sequences rather than excluding (removing) them. When false, matching sequences are filtered out.
rcomp=t
Match reverse complements as well as forward sequences. Essential for DNA sequence filtering where orientation may vary.
case=f
(casesensitive) Require matching case for sequence comparison. When false, case is ignored for matching. Requires storebases=true.
storebases=t
(sb) Store reference bases in memory. When true, Code constructor stores byte[] bases for Tools.equals() comparison. When false, bases=null and equals() relies on 64-bit hash pair comparison (a,b values from Dedupe.hash() methods).
threads=auto
(t) Number of worker threads for processing. Auto-detection uses Shared.threads() to create ProcessThread and LoadThread instances for concurrent read processing via ConcurrentReadInputStream.
subs=0
Maximum number of substitutions allowed in approximate matching. Higher values increase sensitivity but drastically reduce speed.
mf=0.0
(mismatchfraction) Maximum fraction of bases that can mismatch. The actual number allowed is max(subs, mf*min(query.length, ref.length)). Range: 0.0-1.0.
lengthdif=0
Maximum allowed length difference between query and reference sequences. Allows matching of sequences with different lengths within this tolerance.

Java Parameters

-Xmx
Sets Java's memory usage, overriding autodetection. Use format like -Xmx20g for 20 GB or -Xmx200m for 200 MB. Maximum is typically 85% of physical memory.
-eoom
Exit process if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing hung processes in memory-constrained environments.
-da
Disable Java assertions. May provide slight performance improvement in production environments.

Examples

Basic Sequence Filtering

# Remove contaminating sequences from reads
filterbysequence.sh in=reads.fq out=clean.fq ref=contaminants.fa include=f

# Keep only sequences matching specific targets
filterbysequence.sh in=reads.fq out=targets.fq ref=targets.fa include=t

Basic filtering operations showing exclusion (contamination removal) and inclusion (target enrichment).

Literal Sequence Filtering

# Filter using specific sequences
filterbysequence.sh in=reads.fq out=filtered.fq literal=ATGCGTACGT,GCTAGCTAGC include=f

# Remove adapter sequences
filterbysequence.sh in=reads.fq out=clean.fq \
    literal=AGATCGGAAGAGC,CTGTCTCTTATACACATCT include=f

Using literal sequences for targeted filtering without needing reference files.

Approximate Matching

# Allow up to 2 mismatches
filterbysequence.sh in=reads.fq out=filtered.fq ref=targets.fa subs=2 include=t

# Allow 5% mismatch rate
filterbysequence.sh in=reads.fq out=filtered.fq ref=targets.fa mf=0.05 include=t

# Flexible length matching
filterbysequence.sh in=reads.fq out=filtered.fq ref=targets.fa lengthdif=10 include=t

Approximate matching examples. Note: these modes are significantly slower than exact matching.

Memory-Optimized Filtering

# Probabilistic matching for large reference sets
filterbysequence.sh in=reads.fq out=filtered.fq ref=large_db.fa \
    storebases=f case=f include=f

# High-memory exact matching
filterbysequence.sh in=reads.fq out=filtered.fq ref=targets.fa \
    storebases=t case=t include=t -Xmx32g

Memory usage optimization strategies for different scenarios.

Algorithm Details

Exact Matching Strategy

FilterBySequence implements a two-tier hashing strategy for exact sequence matching using the Code class constructor:

Reference Loading Process

Reference sequences are loaded using multithreaded processing via LoadThread instances:

Filtering Algorithm

The core filtering logic operates through ProcessThread instances executing contains() method chain:

Approximate Matching (Slow Mode)

When maxSubs>0 || mismatchFraction>0 || maxLengthDif>0, contains() calls bruteForce() method:

Performance Characteristics

Code Structure

The implementation uses several key classes with specific responsibilities:

Performance Considerations

Speed Optimization

Memory Management

Approximate Matching Caveats

Support

For questions and support: