FilterByTaxa

Script: filterbytaxa.sh Package: tax Class: FilterByTaxa.java

Filters sequences according to their taxonomy, as determined by the sequence name. Sequences should be labeled with a gi number, NCBI taxID, or species name.

Basic Usage

filterbytaxa.sh in=<input file> out=<output file> tree=<tree file> table=<table file> ids=<numbers> level=<name or number>

Parameters

Parameters are organized by their function in the taxonomic filtering process. The tool uses NCBI taxonomy data structures to filter sequences based on taxonomic classification, with support for hierarchical filtering at different taxonomic levels.

I/O parameters

in=<file>
Primary input, or read 1 input.
out=<file>
Primary output, or read 1 output.
results=<file>
Optional; prints a list indicating which taxa were retained.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file.
showspeed=t
(ss) Set to 'f' to suppress display of processing speed.
ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Processing parameters

level=
Taxonomic level, such as phylum. Filtering will operate on sequences within the same taxonomic level as specified ids. If not set, only matches to a node or its descendants will be considered.
reqlevel=
Require nodes to have ancestors at these levels. For example, reqlevel=species,genus would ban nodes that are not defined at both the species and genus levels.
ids=
Comma-delimited list of NCBI numeric IDs. Can also be a file with one taxID per line. Names (like bacteria) are also acceptable.
include=f
'f' will discard filtered sequences, 't' will keep them.
besteffort=f
Intended for include mode. Iteratively increases level while the input file has no hits to the tax list.
tree=<file>
Specify a TaxTree file like tree.taxtree.gz. On Genepool, use 'auto'.
gi=<file>
Specify a gitable file like gitable.int1d.gz. Only needed if gi numbers will be used. On Genepool, use 'auto'.
accession=
Specify one or more comma-delimited NCBI accession to taxid files. Only needed if accessions will be used; requires ~45GB of memory. On Genepool, use 'auto'.
printnodes=t
Print the names of nodes added to the filter.
requirepresent=t
Crash with an error message if a header cannot be resolved to a taxid.

String-matching parameters

regex=
Filter names matching this Java regular expression.
contains=
Filter names containing this substring (case-insensitive).

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Filter by Taxonomic ID

filterbytaxa.sh in=sequences.fasta out=bacteria.fasta tree=auto gi=auto ids=2 include=t

Filters sequences to keep only those from bacteria (taxID 2), using default NCBI taxonomy files.

Filter by Taxonomic Level

filterbytaxa.sh in=sequences.fasta out=eukaryotes.fasta tree=auto ids=2759 level=kingdom include=t

Filters sequences to keep only those from the eukaryotic kingdom (taxID 2759) at the kingdom taxonomic level.

Filter Using Multiple Criteria

filterbytaxa.sh in=sequences.fasta out=filtered.fasta tree=auto ids=bacteria,archaea reqlevel=species,genus

Filters sequences from bacteria and archaea, requiring that sequences have valid species and genus level annotations.

String-based Filtering

filterbytaxa.sh in=sequences.fasta out=ecoli.fasta regex=".*Escherichia.*coli.*" include=t

Uses regular expression matching to keep sequences whose names contain "Escherichia" followed by "coli".

Generate Taxa Report

filterbytaxa.sh in=sequences.fasta out=filtered.fasta tree=auto ids=2 results=taxa_report.txt

Filters bacterial sequences and generates a report file listing all taxa that were retained.

Algorithm Details

Taxonomic Filtering Architecture

FilterByTaxa implements taxonomic filtering through the TaxFilter.makeFilter() factory method, which constructs filtering criteria using NCBI's hierarchical taxonomy structure. The tool operates through three core filtering mechanisms: HashSet<Integer> taxID collections for exact numeric matching, Pattern.compile() regex matching with Java regex engine, and case-insensitive substring filtering via String.toLowerCase().contains().

Core Processing Strategy

The tool processes sequences through ConcurrentReadInputStream.getReadInputStream() with configurable thread buffer limits using Shared.capBuffers(4). Each sequence is evaluated in processReadPair() method which calls TaxFilter.passesFilter() against the sequence header ID:

TaxTree Integration

The tool integrates with BBTools' TaxTree data structure loaded via TaxTree.loadTaxTree() which provides taxonomic tree operations:

Memory Management

The tool implements memory optimization through several concrete strategies:

Best Effort Mode Implementation

When besteffort=true, TaxFilter.reviseByBestEffort() implements adaptive taxonomic scope widening through iterative level promotion:

Performance Characteristics

The filtering performance depends on specific implementation factors:

Error Handling Implementation

The tool provides structured error handling through specific validation mechanisms:

Support

For questions and support: