FilterByTaxa
Filters sequences according to their taxonomy, as determined by the sequence name. Sequences should be labeled with a gi number, NCBI taxID, or species name.
Basic Usage
filterbytaxa.sh in=<input file> out=<output file> tree=<tree file> table=<table file> ids=<numbers> level=<name or number>
Parameters
Parameters are organized by their function in the taxonomic filtering process. The tool uses NCBI taxonomy data structures to filter sequences based on taxonomic classification, with support for hierarchical filtering at different taxonomic levels.
I/O parameters
- in=<file>
- Primary input, or read 1 input.
- out=<file>
- Primary output, or read 1 output.
- results=<file>
- Optional; prints a list indicating which taxa were retained.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- showspeed=t
- (ss) Set to 'f' to suppress display of processing speed.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
Processing parameters
- level=
- Taxonomic level, such as phylum. Filtering will operate on sequences within the same taxonomic level as specified ids. If not set, only matches to a node or its descendants will be considered.
- reqlevel=
- Require nodes to have ancestors at these levels. For example, reqlevel=species,genus would ban nodes that are not defined at both the species and genus levels.
- ids=
- Comma-delimited list of NCBI numeric IDs. Can also be a file with one taxID per line. Names (like bacteria) are also acceptable.
- include=f
- 'f' will discard filtered sequences, 't' will keep them.
- besteffort=f
- Intended for include mode. Iteratively increases level while the input file has no hits to the tax list.
- tree=<file>
- Specify a TaxTree file like tree.taxtree.gz. On Genepool, use 'auto'.
- gi=<file>
- Specify a gitable file like gitable.int1d.gz. Only needed if gi numbers will be used. On Genepool, use 'auto'.
- accession=
- Specify one or more comma-delimited NCBI accession to taxid files. Only needed if accessions will be used; requires ~45GB of memory. On Genepool, use 'auto'.
- printnodes=t
- Print the names of nodes added to the filter.
- requirepresent=t
- Crash with an error message if a header cannot be resolved to a taxid.
String-matching parameters
- regex=
- Filter names matching this Java regular expression.
- contains=
- Filter names containing this substring (case-insensitive).
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Filter by Taxonomic ID
filterbytaxa.sh in=sequences.fasta out=bacteria.fasta tree=auto gi=auto ids=2 include=t
Filters sequences to keep only those from bacteria (taxID 2), using default NCBI taxonomy files.
Filter by Taxonomic Level
filterbytaxa.sh in=sequences.fasta out=eukaryotes.fasta tree=auto ids=2759 level=kingdom include=t
Filters sequences to keep only those from the eukaryotic kingdom (taxID 2759) at the kingdom taxonomic level.
Filter Using Multiple Criteria
filterbytaxa.sh in=sequences.fasta out=filtered.fasta tree=auto ids=bacteria,archaea reqlevel=species,genus
Filters sequences from bacteria and archaea, requiring that sequences have valid species and genus level annotations.
String-based Filtering
filterbytaxa.sh in=sequences.fasta out=ecoli.fasta regex=".*Escherichia.*coli.*" include=t
Uses regular expression matching to keep sequences whose names contain "Escherichia" followed by "coli".
Generate Taxa Report
filterbytaxa.sh in=sequences.fasta out=filtered.fasta tree=auto ids=2 results=taxa_report.txt
Filters bacterial sequences and generates a report file listing all taxa that were retained.
Algorithm Details
Taxonomic Filtering Architecture
FilterByTaxa implements taxonomic filtering through the TaxFilter.makeFilter() factory method, which constructs filtering criteria using NCBI's hierarchical taxonomy structure. The tool operates through three core filtering mechanisms: HashSet<Integer> taxID collections for exact numeric matching, Pattern.compile() regex matching with Java regex engine, and case-insensitive substring filtering via String.toLowerCase().contains().
Core Processing Strategy
The tool processes sequences through ConcurrentReadInputStream.getReadInputStream() with configurable thread buffer limits using Shared.capBuffers(4). Each sequence is evaluated in processReadPair() method which calls TaxFilter.passesFilter() against the sequence header ID:
- Taxonomic ID resolution: TaxTree.parseNodeFromHeader() extracts taxIDs from sequence headers using GI number, NCBI taxID, or species name parsing
- HashSet-based matching: Uses HashSet<Integer> taxSet.contains() for O(1) average-case taxID lookups
- Hierarchical traversal: TaxNode parent-child navigation via tn.pid references until tn.id==tn.pid root condition
- Bitflag level validation: reqLevels uses bit masking with (1<<tn.level) operations for taxonomic level requirements
TaxTree Integration
The tool integrates with BBTools' TaxTree data structure loaded via TaxTree.loadTaxTree() which provides taxonomic tree operations:
- Node lookup: TaxTree.getNode(taxID) provides direct taxID-to-TaxNode mapping
- Level promotion: TaxTree.getIdAtLevelExtended() converts taxIDs to specified taxonomic levels using tree traversal
- Name resolution: TaxTree.parseNameToTaxid() converts taxonomic names to numeric IDs via nameMap HashMap
- Extended level support: TaxTree.parseLevelExtended() handles both standard NCBI levels and extended hierarchy levels
Memory Management
The tool implements memory optimization through several concrete strategies:
- Concurrent streaming: ConcurrentReadInputStream with buffer management through ReadWrite.capBuffers() controls memory footprint per thread
- Conditional loading: GI table loading via GiToTaxid.initialize() only when giTableFile parameter is provided, saves ~2-4GB when not needed
- LinkedHashSet tracking: Uses LinkedHashSet<TaxNode> nodes for results file generation, maintaining insertion order while preventing duplicates
- Shared buffer pools: ByteFile.FORCE_MODE_BF2 enables memory-mapped file access for large taxonomy files when thread count > 2
Best Effort Mode Implementation
When besteffort=true, TaxFilter.reviseByBestEffort() implements adaptive taxonomic scope widening through iterative level promotion:
- Initial level: Uses user-specified taxLevelE from TaxTree.parseLevelExtended()
- Intersection detection: Compares desired HashSet with present taxa from input file via TextFile line-by-line parsing
- Level escalation: Increments currentLevelE while currentLevelE<TaxTree.LIFE_E until intersection found
- Completion reporting: System.err.println() reports final taxonomic level used when different from initial
Performance Characteristics
The filtering performance depends on specific implementation factors:
- TaxTree size: Full NCBI tree via TaxTree.loadTaxTree() contains ~2 million nodes with HashMap-based lookups
- GI table overhead: GiToTaxid.initialize() loads GI-to-taxID mapping requiring ~2-4GB memory via int array structures
- AccessionToTaxid overhead: AccessionToTaxid.load() requires ~45GB memory for full NCBI accession-to-taxID mapping
- Regex compilation cost: Pattern.compile() occurs once during TaxFilter construction, subsequent Pattern.matcher() calls are O(n) per sequence name
- File format impact: FastaReadInputStream processing is more efficient than FASTQ due to simpler header parsing in parseNodeFromHeader()
Error Handling Implementation
The tool provides structured error handling through specific validation mechanisms:
- Node resolution failures: REQUIRE_PRESENT flag controls KillSwitch.kill() behavior when TaxTree.parseNodeFromHeader() returns null
- File validation: Tools.testInputFiles() and Tools.testOutputFiles() perform pre-flight accessibility checks before processing
- Memory overflow protection: -eoom flag enables out-of-memory exception handling requiring Java 8u92+
- Duplicate file detection: Tools.testForDuplicateFiles() prevents input/output file conflicts during initialization
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org