EstherFilter

Script: estherfilter.sh Package: driver Class: EstherFilter.java

BLASTs queries against reference, and filters out hits with scores less than 'cutoff'. The score is taken from column 12 of the BLAST output. The specific BLAST command is: blastall -p blastn -i QUERY -d REFERENCE -e 0.00001 -m 8

Basic Usage

estherfilter.sh <query> <reference> <cutoff>

This tool requires exactly three positional arguments:

query: The input query FASTA file to BLAST against the reference
reference: The reference database to BLAST against (must be formatted for BLAST)
cutoff: The minimum BLAST score threshold from column 12 of BLAST output

Parameters

This tool uses positional arguments with one optional parameter:

Positional Arguments

<query>: Input FASTA file containing query sequences to be BLASTed
<reference>: Reference database file (must be BLAST-formatted)
<cutoff>: Minimum BLAST score threshold. Only hits with scores >= cutoff will be retained

Optional Parameters

fasta: Fourth argument. When set to "fasta", outputs results in FASTA format instead of just sequence names. Requires more memory as it loads the entire query file.

Examples

Basic Filtering

estherfilter.sh reads.fasta genes.fasta 1000 > results.txt

BLASTs reads.fasta against genes.fasta and outputs only query sequence names that have BLAST hits with scores ≥ 1000

FASTA Output

estherfilter.sh reads.fasta genes.fasta 1000 fasta > filtered_sequences.fasta

Same filtering as above, but outputs the actual FASTA sequences instead of just names. Uses more memory to load and process the query file.

Lower Stringency Filtering

estherfilter.sh contigs.fasta reference_genome.fasta 500 > matching_contigs.txt

Filters contigs against a reference genome using a lower score threshold of 500

Algorithm Details

EstherFilter is a BLAST-based sequence filtering tool that combines external BLAST execution with score-based filtering through ReadWrite.getInputStreamFromProcess(). The implementation operates in distinct phases with specific data structures and methods:

BLAST Execution Phase

The tool executes BLAST commands through ReadWrite.getInputStreamFromProcess("foo", command, false, false, true):

blastall -p blastn -i [query] -d [reference] -e 0.00001 -m 8

Process Management: Uses ReadWrite.FORCE_KILL=true for reliable process termination
Stream Handling: Creates InputStreamReader with 32KB BufferedReader (BufferedReader(isr, 32768))
-p blastn: Uses nucleotide-nucleotide BLAST (blastn)
-e 0.00001: Sets E-value threshold to 1e-5 for statistical significance
-m 8: Outputs results in tabular format (tab-delimited)

Score-Based Filtering Implementation

Two distinct processing methods handle BLAST output parsing:

processToNames() Method

Streams BLAST output line-by-line through BufferedReader.readLine()
Splits tab-delimited lines using String.split("\t")
Extracts BLAST score via Float.parseFloat(split[11].trim()) from column 12
Applies threshold filtering with score >= cutoff comparison
Outputs query names directly via System.out.println(split[0])
No duplicate tracking - processes each qualifying line independently

processToFasta() Method

Uses ArrayList<String> names collection for qualified sequence storage
Implements duplicate prevention with prev.equals(split[0]) comparison
Collects all passing query names before sequence extraction
Calls outputFasta(query, names) for sequence retrieval phase

FASTA Output Implementation

The outputFasta() method implements a two-stage sequence extraction process:

Stage 1: Name Sorting

Sorts name collection using Shared.sort(names) for consistent ordering
Prepares ArrayList for binary search operations

Stage 2: Sequence Extraction

Creates FileFormat object via FileFormat.testInput(fname, FileFormat.FASTA, null, false, true)
Initializes ConcurrentReadInputStream.getReadInputStream(-1L, false, ff, null)
Processes reads in ListNum<Read> chunks through cris.nextList()
Matches sequence IDs using Collections.binarySearch(names, r.id) for O(log n) lookup
Outputs matched sequences via r.toFasta(70) with 70-character line wrapping
Manages memory through cris.returnList(ln) after each chunk

Performance Characteristics

Memory Usage: Names mode uses 32KB BufferedReader buffer, FASTA mode stores ArrayList<String> plus Read objects
Search Complexity: Binary search provides O(log n) sequence ID matching in FASTA mode
I/O Strategy: Single-pass streaming for names mode, two-pass for FASTA mode (BLAST + sequence file)
Process Management: ReadWrite.finishReading() ensures proper stream cleanup and process termination

BBTools Integration Points

EstherFilter leverages specific BBTools infrastructure components:

ReadWrite: Process management through getInputStreamFromProcess() and finishReading()
ConcurrentReadInputStream: Multithreaded sequence file reading with chunk-based processing
Shared.sort(): Consistent sorting implementation across BBTools
FileFormat: Automatic file type detection with testInput() method
Read.toFasta(): Standardized FASTA output formatting with configurable line length

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org