FilterAssemblySummary

Script: filterassemblysummary.sh Package: driver Class: FilterAssemblySummary.java

Filters NCBI assembly summaries according to their taxonomy using taxonomic trees and ID-based filtering. Supports hierarchical taxonomic filtering with customizable taxonomic levels and ancestor requirements. Processes standard NCBI assembly summary files with tab-separated values, extracting taxonomic IDs from column 6 for classification.

Basic Usage

filterassemblysummary.sh in=<input file> out=<output file> tree=<tree file> table=<table file> ids=<numbers> level=<name or number>

Filters NCBI assembly summary files based on taxonomic criteria. Requires a taxonomic tree file and can optionally use GI-to-TaxID translation tables for sequence identification.

Data Sources

FilterAssemblySummary processes standard NCBI assembly summary files available from:

Parameters

Parameters are organized by their function in the filtering process, following the organization used in the shell script.

Standard parameters

in=<file>
Primary input file. Must be an NCBI assembly summary file with tab-separated values. The tool expects taxonomic ID information in column 6 (zero-indexed).
out=<file>
Primary output file. Filtered results will be written here, maintaining the original file format but containing only entries that pass the taxonomic filter.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false

Processing parameters

level=
Taxonomic level for filtering, such as phylum, class, order, family, genus, or species. Can be specified by name or numeric code. Filtering operates on sequences within the same taxonomic level as the specified IDs. The tool will promote or demote taxonomic nodes to match this level when using hierarchical filtering.
reqlevel=
Require nodes to have ancestors at these specific taxonomic levels. Comma-delimited list of level names. For example, reqlevel=species,genus would exclude nodes that are not defined at both the species and genus levels. Uses bitflag representation internally for efficient level checking.
ids=
Comma-delimited list of NCBI numeric taxonomic IDs to filter by. These IDs define the taxonomic groups to include or exclude. Can also accept file paths containing taxonomic IDs, with one ID per line. IDs are automatically promoted to the specified taxonomic level if promote=true.
names=
Alternative to numeric IDs - comma-delimited list of taxonomic names (such as 'Homo sapiens', 'Bacteria'). Scientific names with spaces require special shell escaping. Names are resolved to taxonomic IDs using the loaded taxonomic tree. Multiple nodes may match ambiguous names.
include=f
Controls filter behavior: 'f' (false) will discard entries matching the filter criteria (exclude mode), 't' (true) will keep only entries matching the criteria (include mode). Default: false (exclude mode)
tree=
Path to taxonomic tree file created by TaxTree (such as tree.taxtree.gz). Required for taxonomic name resolution and hierarchical filtering. Contains the complete NCBI taxonomic hierarchy with parent-child relationships and taxonomic levels.
table=
Path to translation table mapping GI numbers to NCBI taxonomic IDs. Only needed if the assembly summary contains GI numbers instead of direct taxonomic IDs. Generated using gitable.sh from NCBI data.

Advanced filtering parameters

promote=t
Enable hierarchical promotion of taxonomic nodes. When true, nodes are promoted to ancestors until they reach the specified taxonomic level. When false, uses exact taxonomic ID matching only. Default: true
regex=
Regular expression pattern for name-based filtering. Applied to taxonomic names in addition to ID-based filtering. Uses Java regex syntax. Must match the entire name string.
contains=
Substring matching for taxonomic names. Case-insensitive search within taxonomic names. Applied in addition to other filtering criteria.
requirepresent=t
Controls behavior when taxonomic nodes are not found in the tree. When true, missing nodes cause the program to exit with an error. When false, missing nodes generate warnings but processing continues. Default: true
printnodesadded=t
(printnodes) Print debugging information about taxonomic nodes added to the filter set during hierarchical promotion. Useful for verifying filter behavior. Default: true

Java Parameters

-Xmx
Sets Java's memory usage, overriding autodetection. Format: -Xmx20g for 20 GB, -Xmx200m for 200 MB. Maximum is typically 85% of physical memory. Large taxonomic trees may require substantial memory (4GB+ recommended).
-eoom
Exit on out-of-memory exception. Requires Java 8u92 or later. Prevents incomplete processing when memory is exhausted during large file operations or tree loading.
-da
Disable Java assertions. May provide minor performance improvement in production use, but assertions help catch configuration errors during development.

Examples

Basic taxonomic filtering

filterassemblysummary.sh in=assembly_summary_genbank.txt out=bacteria_assemblies.txt \
    tree=tree.taxtree.gz ids=2 level=kingdom

Filters GenBank assembly summaries to include only bacterial assemblies (taxonomic ID 2 = Bacteria) at the kingdom level.

Species-level filtering with multiple organisms

filterassemblysummary.sh in=assembly_summary_refseq.txt out=pathogens.txt \
    tree=tree.taxtree.gz ids=562,1428,210007 level=species include=t

Includes only assemblies from specific pathogenic species: E. coli (562), Streptococcus pneumoniae (1428), and Streptococcus agalactiae (210007).

Name-based filtering with ancestor requirements

filterassemblysummary.sh in=assembly_summary_genbank.txt out=mammals.txt \
    tree=tree.taxtree.gz names="Mammalia" level=class \
    reqlevel=species,genus include=t

Filters for mammalian assemblies, requiring that entries have taxonomic definitions at both species and genus levels.

Exclusion filtering with regex patterns

filterassemblysummary.sh in=assembly_summary_genbank.txt out=no_env_samples.txt \
    tree=tree.taxtree.gz regex=".*environmental.*" include=f

Excludes environmental samples by filtering out entries with "environmental" in their taxonomic names.

Algorithm Details

Input Processing Architecture

FilterAssemblySummary.processLine() implements single-pass line processing using TextFile.nextLine() for streaming input. The tool parses tab-separated values through String.split("\t"), extracting taxonomic IDs from column index 6 using Integer.parseInt(). Comment lines starting with '#' return null from processLine() and are automatically excluded from output.

Taxonomic Filtering Implementation

The core filtering operates through TaxFilter.passesFilter(int) method, which queries TaxTree.getNode(int) for taxonomic node retrieval. Each TaxNode contains:

Hierarchical Promotion Strategy

When promote=true, TaxFilter.addNode() implements tree traversal using parent-child navigation:

  1. Level Resolution: TaxTree.parseLevelExtended() converts level names to extended integers
  2. Node Traversal: while(tn.id!=tn.pid && tn.levelExtended<taxLevelE) loop ascends tree
  3. Level Matching: Comparison tn.levelExtended<=taxLevelE determines inclusion boundary
  4. Set Population: HashSet.add(tn.id) accumulates promoted nodes

Multi-Criteria Filtering Engine

TaxFilter.passesFilter() implements compound boolean evaluation through multiple filter stages:

Performance Optimizations

Multiple algorithmic strategies optimize large dataset processing:

Memory Management

Explicit memory management prevents heap exhaustion during large file processing:

Error Handling and Validation

Comprehensive validation ensures data integrity through multiple safety mechanisms:

File Format Specifications

Input Format

NCBI assembly summary files use tab-separated values with the following relevant columns:

Taxonomic Tree Format

Tree files (.taxtree.gz) contain serialized taxonomic hierarchies with node relationships, levels, and name mappings. Generated by TaxTree from NCBI taxonomy dumps.

GI Table Format

Translation tables map GenInfo identifiers to NCBI taxonomic IDs for legacy sequence identification. Generated using gitable.sh from NCBI gi_taxid_nucl.dmp files.

Resource Requirements

File Locations

For Genepool users, pre-built taxonomic resources are available at:

/global/projectb/sandbox/gaag/bbtools/tax

Building Custom Resources

For non-Genepool users or to create updated resources:

Memory Requirements

Support

For questions and support: