FilterByTaxa

Basic Usage

filterbytaxa.sh in=<input file> out=<output file> tree=<tree file> table=<table file> ids=<numbers> level=<name or number>

Parameters

Parameters are organized by their function in the taxonomic filtering process. The tool uses NCBI taxonomy data structures to filter sequences based on taxonomic classification, with support for hierarchical filtering at different taxonomic levels.

I/O parameters

in=<file>: Primary input, or read 1 input.
out=<file>: Primary output, or read 1 output.
results=<file>: Optional; prints a list indicating which taxa were retained.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file.
showspeed=t: (ss) Set to 'f' to suppress display of processing speed.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Processing parameters

level=: Taxonomic level, such as phylum. Filtering will operate on sequences within the same taxonomic level as specified ids. If not set, only matches to a node or its descendants will be considered.
reqlevel=: Require nodes to have ancestors at these levels. For example, reqlevel=species,genus would ban nodes that are not defined at both the species and genus levels.
ids=: Comma-delimited list of NCBI numeric IDs. Can also be a file with one taxID per line. Names (like bacteria) are also acceptable.
include=f: 'f' will discard filtered sequences, 't' will keep them.
besteffort=f: Intended for include mode. Iteratively increases level while the input file has no hits to the tax list.
tree=<file>: Specify a TaxTree file like tree.taxtree.gz. On Genepool, use 'auto'.
gi=<file>: Specify a gitable file like gitable.int1d.gz. Only needed if gi numbers will be used. On Genepool, use 'auto'.
accession=: Specify one or more comma-delimited NCBI accession to taxid files. Only needed if accessions will be used; requires ~45GB of memory. On Genepool, use 'auto'.
printnodes=t: Print the names of nodes added to the filter.
requirepresent=t: Crash with an error message if a header cannot be resolved to a taxid.

String-matching parameters

regex=: Filter names matching this Java regular expression.
contains=: Filter names containing this substring (case-insensitive).

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Filter by Taxonomic ID

filterbytaxa.sh in=sequences.fasta out=bacteria.fasta tree=auto gi=auto ids=2 include=t

Filters sequences to keep only those from bacteria (taxID 2), using default NCBI taxonomy files.

Filter by Taxonomic Level

filterbytaxa.sh in=sequences.fasta out=eukaryotes.fasta tree=auto ids=2759 level=kingdom include=t

Filters sequences to keep only those from the eukaryotic kingdom (taxID 2759) at the kingdom taxonomic level.

Filter Using Multiple Criteria

filterbytaxa.sh in=sequences.fasta out=filtered.fasta tree=auto ids=bacteria,archaea reqlevel=species,genus

Filters sequences from bacteria and archaea, requiring that sequences have valid species and genus level annotations.

String-based Filtering

filterbytaxa.sh in=sequences.fasta out=ecoli.fasta regex=".*Escherichia.*coli.*" include=t

Uses regular expression matching to keep sequences whose names contain "Escherichia" followed by "coli".

Generate Taxa Report

filterbytaxa.sh in=sequences.fasta out=filtered.fasta tree=auto ids=2 results=taxa_report.txt

Filters bacterial sequences and generates a report file listing all taxa that were retained.

Algorithm Details

Taxonomic Filtering Architecture

FilterByTaxa implements taxonomic filtering through the TaxFilter.makeFilter() factory method, which constructs filtering criteria using NCBI's hierarchical taxonomy structure. The tool operates through three core filtering mechanisms: HashSet<Integer> taxID collections for exact numeric matching, Pattern.compile() regex matching with Java regex engine, and case-insensitive substring filtering via String.toLowerCase().contains().

Core Processing Strategy

The tool processes sequences through ConcurrentReadInputStream.getReadInputStream() with configurable thread buffer limits using Shared.capBuffers(4). Each sequence is evaluated in processReadPair() method which calls TaxFilter.passesFilter() against the sequence header ID:

Taxonomic ID resolution: TaxTree.parseNodeFromHeader() extracts taxIDs from sequence headers using GI number, NCBI taxID, or species name parsing
HashSet-based matching: Uses HashSet<Integer> taxSet.contains() for O(1) average-case taxID lookups
Hierarchical traversal: TaxNode parent-child navigation via tn.pid references until tn.id==tn.pid root condition
Bitflag level validation: reqLevels uses bit masking with (1<<tn.level) operations for taxonomic level requirements

TaxTree Integration

The tool integrates with BBTools' TaxTree data structure loaded via TaxTree.loadTaxTree() which provides taxonomic tree operations:

Node lookup: TaxTree.getNode(taxID) provides direct taxID-to-TaxNode mapping
Level promotion: TaxTree.getIdAtLevelExtended() converts taxIDs to specified taxonomic levels using tree traversal
Name resolution: TaxTree.parseNameToTaxid() converts taxonomic names to numeric IDs via nameMap HashMap
Extended level support: TaxTree.parseLevelExtended() handles both standard NCBI levels and extended hierarchy levels

Memory Management

The tool implements memory optimization through several concrete strategies:

Concurrent streaming: ConcurrentReadInputStream with buffer management through ReadWrite.capBuffers() controls memory footprint per thread
Conditional loading: GI table loading via GiToTaxid.initialize() only when giTableFile parameter is provided, saves ~2-4GB when not needed
LinkedHashSet tracking: Uses LinkedHashSet<TaxNode> nodes for results file generation, maintaining insertion order while preventing duplicates
Shared buffer pools: ByteFile.FORCE_MODE_BF2 enables memory-mapped file access for large taxonomy files when thread count > 2

Best Effort Mode Implementation

When besteffort=true, TaxFilter.reviseByBestEffort() implements adaptive taxonomic scope widening through iterative level promotion:

Initial level: Uses user-specified taxLevelE from TaxTree.parseLevelExtended()
Intersection detection: Compares desired HashSet with present taxa from input file via TextFile line-by-line parsing
Level escalation: Increments currentLevelE while currentLevelE<TaxTree.LIFE_E until intersection found
Completion reporting: System.err.println() reports final taxonomic level used when different from initial

Performance Characteristics

The filtering performance depends on specific implementation factors:

TaxTree size: Full NCBI tree via TaxTree.loadTaxTree() contains ~2 million nodes with HashMap-based lookups
GI table overhead: GiToTaxid.initialize() loads GI-to-taxID mapping requiring ~2-4GB memory via int array structures
AccessionToTaxid overhead: AccessionToTaxid.load() requires ~45GB memory for full NCBI accession-to-taxID mapping
Regex compilation cost: Pattern.compile() occurs once during TaxFilter construction, subsequent Pattern.matcher() calls are O(n) per sequence name
File format impact: FastaReadInputStream processing is more efficient than FASTQ due to simpler header parsing in parseNodeFromHeader()

Error Handling Implementation

The tool provides structured error handling through specific validation mechanisms:

Node resolution failures: REQUIRE_PRESENT flag controls KillSwitch.kill() behavior when TaxTree.parseNodeFromHeader() returns null
File validation: Tools.testInputFiles() and Tools.testOutputFiles() perform pre-flight accessibility checks before processing
Memory overflow protection: -eoom flag enables out-of-memory exception handling requiring Java 8u92+
Duplicate file detection: Tools.testForDuplicateFiles() prevents input/output file conflicts during initialization

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org