EstherFilter
BLASTs queries against reference, and filters out hits with scores less than 'cutoff'. The score is taken from column 12 of the BLAST output. The specific BLAST command is: blastall -p blastn -i QUERY -d REFERENCE -e 0.00001 -m 8
Basic Usage
estherfilter.sh <query> <reference> <cutoff>
This tool requires exactly three positional arguments:
- query: The input query FASTA file to BLAST against the reference
- reference: The reference database to BLAST against (must be formatted for BLAST)
- cutoff: The minimum BLAST score threshold from column 12 of BLAST output
Parameters
This tool uses positional arguments with one optional parameter:
Positional Arguments
- <query>
- Input FASTA file containing query sequences to be BLASTed
- <reference>
- Reference database file (must be BLAST-formatted)
- <cutoff>
- Minimum BLAST score threshold. Only hits with scores >= cutoff will be retained
Optional Parameters
- fasta
- Fourth argument. When set to "fasta", outputs results in FASTA format instead of just sequence names. Requires more memory as it loads the entire query file.
Examples
Basic Filtering
estherfilter.sh reads.fasta genes.fasta 1000 > results.txt
BLASTs reads.fasta against genes.fasta and outputs only query sequence names that have BLAST hits with scores ≥ 1000
FASTA Output
estherfilter.sh reads.fasta genes.fasta 1000 fasta > filtered_sequences.fasta
Same filtering as above, but outputs the actual FASTA sequences instead of just names. Uses more memory to load and process the query file.
Lower Stringency Filtering
estherfilter.sh contigs.fasta reference_genome.fasta 500 > matching_contigs.txt
Filters contigs against a reference genome using a lower score threshold of 500
Algorithm Details
EstherFilter is a BLAST-based sequence filtering tool that combines external BLAST execution with score-based filtering through ReadWrite.getInputStreamFromProcess(). The implementation operates in distinct phases with specific data structures and methods:
BLAST Execution Phase
The tool executes BLAST commands through ReadWrite.getInputStreamFromProcess("foo", command, false, false, true):
blastall -p blastn -i [query] -d [reference] -e 0.00001 -m 8
- Process Management: Uses ReadWrite.FORCE_KILL=true for reliable process termination
- Stream Handling: Creates InputStreamReader with 32KB BufferedReader (BufferedReader(isr, 32768))
- -p blastn: Uses nucleotide-nucleotide BLAST (blastn)
- -e 0.00001: Sets E-value threshold to 1e-5 for statistical significance
- -m 8: Outputs results in tabular format (tab-delimited)
Score-Based Filtering Implementation
Two distinct processing methods handle BLAST output parsing:
processToNames() Method
- Streams BLAST output line-by-line through BufferedReader.readLine()
- Splits tab-delimited lines using String.split("\t")
- Extracts BLAST score via Float.parseFloat(split[11].trim()) from column 12
- Applies threshold filtering with score >= cutoff comparison
- Outputs query names directly via System.out.println(split[0])
- No duplicate tracking - processes each qualifying line independently
processToFasta() Method
- Uses ArrayList<String> names collection for qualified sequence storage
- Implements duplicate prevention with prev.equals(split[0]) comparison
- Collects all passing query names before sequence extraction
- Calls outputFasta(query, names) for sequence retrieval phase
FASTA Output Implementation
The outputFasta() method implements a two-stage sequence extraction process:
Stage 1: Name Sorting
- Sorts name collection using Shared.sort(names) for consistent ordering
- Prepares ArrayList for binary search operations
Stage 2: Sequence Extraction
- Creates FileFormat object via FileFormat.testInput(fname, FileFormat.FASTA, null, false, true)
- Initializes ConcurrentReadInputStream.getReadInputStream(-1L, false, ff, null)
- Processes reads in ListNum<Read> chunks through cris.nextList()
- Matches sequence IDs using Collections.binarySearch(names, r.id) for O(log n) lookup
- Outputs matched sequences via r.toFasta(70) with 70-character line wrapping
- Manages memory through cris.returnList(ln) after each chunk
Performance Characteristics
- Memory Usage: Names mode uses 32KB BufferedReader buffer, FASTA mode stores ArrayList<String> plus Read objects
- Search Complexity: Binary search provides O(log n) sequence ID matching in FASTA mode
- I/O Strategy: Single-pass streaming for names mode, two-pass for FASTA mode (BLAST + sequence file)
- Process Management: ReadWrite.finishReading() ensures proper stream cleanup and process termination
BBTools Integration Points
EstherFilter leverages specific BBTools infrastructure components:
- ReadWrite: Process management through getInputStreamFromProcess() and finishReading()
- ConcurrentReadInputStream: Multithreaded sequence file reading with chunk-based processing
- Shared.sort(): Consistent sorting implementation across BBTools
- FileFormat: Automatic file type detection with testInput() method
- Read.toFasta(): Standardized FASTA output formatting with configurable line length
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org