FilterAssemblySummary

Basic Usage

filterassemblysummary.sh in=<input file> out=<output file> tree=<tree file> table=<table file> ids=<numbers> level=<name or number>

Filters NCBI assembly summary files based on taxonomic criteria. Requires a taxonomic tree file and can optionally use GI-to-TaxID translation tables for sequence identification.

Data Sources

FilterAssemblySummary processes standard NCBI assembly summary files available from:

GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
GenBank (alternate): ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt
RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
RefSeq (alternate): ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

Parameters

Parameters are organized by their function in the filtering process, following the organization used in the shell script.

Standard parameters

in=<file>: Primary input file. Must be an NCBI assembly summary file with tab-separated values. The tool expects taxonomic ID information in column 6 (zero-indexed).
out=<file>: Primary output file. Filtered results will be written here, maintaining the original file format but containing only entries that pass the taxonomic filter.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false

Processing parameters

level=: Taxonomic level for filtering, such as phylum, class, order, family, genus, or species. Can be specified by name or numeric code. Filtering operates on sequences within the same taxonomic level as the specified IDs. The tool will promote or demote taxonomic nodes to match this level when using hierarchical filtering.
reqlevel=: Require nodes to have ancestors at these specific taxonomic levels. Comma-delimited list of level names. For example, reqlevel=species,genus would exclude nodes that are not defined at both the species and genus levels. Uses bitflag representation internally for efficient level checking.
ids=: Comma-delimited list of NCBI numeric taxonomic IDs to filter by. These IDs define the taxonomic groups to include or exclude. Can also accept file paths containing taxonomic IDs, with one ID per line. IDs are automatically promoted to the specified taxonomic level if promote=true.
names=: Alternative to numeric IDs - comma-delimited list of taxonomic names (such as 'Homo sapiens', 'Bacteria'). Scientific names with spaces require special shell escaping. Names are resolved to taxonomic IDs using the loaded taxonomic tree. Multiple nodes may match ambiguous names.
include=f: Controls filter behavior: 'f' (false) will discard entries matching the filter criteria (exclude mode), 't' (true) will keep only entries matching the criteria (include mode). Default: false (exclude mode)
tree=: Path to taxonomic tree file created by TaxTree (such as tree.taxtree.gz). Required for taxonomic name resolution and hierarchical filtering. Contains the complete NCBI taxonomic hierarchy with parent-child relationships and taxonomic levels.
table=: Path to translation table mapping GI numbers to NCBI taxonomic IDs. Only needed if the assembly summary contains GI numbers instead of direct taxonomic IDs. Generated using gitable.sh from NCBI data.

Advanced filtering parameters

promote=t: Enable hierarchical promotion of taxonomic nodes. When true, nodes are promoted to ancestors until they reach the specified taxonomic level. When false, uses exact taxonomic ID matching only. Default: true
regex=: Regular expression pattern for name-based filtering. Applied to taxonomic names in addition to ID-based filtering. Uses Java regex syntax. Must match the entire name string.
contains=: Substring matching for taxonomic names. Case-insensitive search within taxonomic names. Applied in addition to other filtering criteria.
requirepresent=t: Controls behavior when taxonomic nodes are not found in the tree. When true, missing nodes cause the program to exit with an error. When false, missing nodes generate warnings but processing continues. Default: true
printnodesadded=t: (printnodes) Print debugging information about taxonomic nodes added to the filter set during hierarchical promotion. Useful for verifying filter behavior. Default: true

Java Parameters

-Xmx: Sets Java's memory usage, overriding autodetection. Format: -Xmx20g for 20 GB, -Xmx200m for 200 MB. Maximum is typically 85% of physical memory. Large taxonomic trees may require substantial memory (4GB+ recommended).
-eoom: Exit on out-of-memory exception. Requires Java 8u92 or later. Prevents incomplete processing when memory is exhausted during large file operations or tree loading.
-da: Disable Java assertions. May provide minor performance improvement in production use, but assertions help catch configuration errors during development.

Examples

Basic taxonomic filtering

filterassemblysummary.sh in=assembly_summary_genbank.txt out=bacteria_assemblies.txt \
    tree=tree.taxtree.gz ids=2 level=kingdom

Filters GenBank assembly summaries to include only bacterial assemblies (taxonomic ID 2 = Bacteria) at the kingdom level.

Species-level filtering with multiple organisms

filterassemblysummary.sh in=assembly_summary_refseq.txt out=pathogens.txt \
    tree=tree.taxtree.gz ids=562,1428,210007 level=species include=t

Includes only assemblies from specific pathogenic species: E. coli (562), Streptococcus pneumoniae (1428), and Streptococcus agalactiae (210007).

Name-based filtering with ancestor requirements

filterassemblysummary.sh in=assembly_summary_genbank.txt out=mammals.txt \
    tree=tree.taxtree.gz names="Mammalia" level=class \
    reqlevel=species,genus include=t

Filters for mammalian assemblies, requiring that entries have taxonomic definitions at both species and genus levels.

Exclusion filtering with regex patterns

filterassemblysummary.sh in=assembly_summary_genbank.txt out=no_env_samples.txt \
    tree=tree.taxtree.gz regex=".*environmental.*" include=f

Excludes environmental samples by filtering out entries with "environmental" in their taxonomic names.

Algorithm Details

Input Processing Architecture

FilterAssemblySummary.processLine() implements single-pass line processing using TextFile.nextLine() for streaming input. The tool parses tab-separated values through String.split("\t"), extracting taxonomic IDs from column index 6 using Integer.parseInt(). Comment lines starting with '#' return null from processLine() and are automatically excluded from output.

Taxonomic Filtering Implementation

The core filtering operates through TaxFilter.passesFilter(int) method, which queries TaxTree.getNode(int) for taxonomic node retrieval. Each TaxNode contains:

id: Primary NCBI taxonomic identifier
pid: Parent node identifier for tree traversal
level: Integer taxonomic rank for level-based filtering
levelExtended: Extended level representation via TaxTree.levelToExtended()
maxChildLevelExtended: Subtree pruning optimization boundary

Hierarchical Promotion Strategy

When promote=true, TaxFilter.addNode() implements tree traversal using parent-child navigation:

Level Resolution: TaxTree.parseLevelExtended() converts level names to extended integers
Node Traversal: while(tn.id!=tn.pid && tn.levelExtended<taxLevelE) loop ascends tree
Level Matching: Comparison tn.levelExtended<=taxLevelE determines inclusion boundary
Set Population: HashSet.add(tn.id) accumulates promoted nodes

Multi-Criteria Filtering Engine

TaxFilter.passesFilter() implements compound boolean evaluation through multiple filter stages:

ID-based filtering: HashSet.contains(tn.id) provides O(1) taxonomic ID lookup
Name-based filtering: TaxTree.parseNodeFromHeader() and TaxTree.getNodeByName() for string resolution
Regex filtering: Pattern.compile().matcher().matches() using pre-compiled regex patterns
Substring filtering: String.toLowerCase().contains() for case-insensitive matching
Ancestor requirements: Bitwise operations (levels&reqLevels)==reqLevels validate taxonomic completeness

Performance Optimizations

Multiple algorithmic strategies optimize large dataset processing:

HashSet-based lookups: Java HashSet provides O(1) average taxonomic ID verification
Compiled regex patterns: Pattern.compile() pre-compilation avoids repeated compilation overhead
Subtree pruning: tn.maxChildLevelExtended<=maxChildLevelExtended comparison eliminates unnecessary tree traversal
Stream processing: TextFile line-by-line reading prevents full file memory loading
Bitflag operations: Integer bitwise AND (1<<tn.level) for level requirement validation

Memory Management

Explicit memory management prevents heap exhaustion during large file processing:

Default allocation: -Xmx4g and -Xms4g heap initialization in shell script
Buffer management: Shared.capBuffers(4) limits concurrent I/O buffer allocation
Garbage collection: System.gc() calls in TaxFilter.loadAccession() after data structure loading
Streaming I/O: TextFile and TextStreamWriter prevent full file loading

Error Handling and Validation

Comprehensive validation ensures data integrity through multiple safety mechanisms:

Missing nodes: REQUIRE_PRESENT static flag triggers KillSwitch.kill() for unresolved taxonomic IDs
File validation: Tools.testOutputFiles() and FileFormat.testInput() verify accessibility
Format validation: assert(split.length>6) ensures proper tab-delimited structure
Tree consistency: TaxTree.loadTaxTree() validates taxonomic hierarchy completeness

File Format Specifications

Input Format

NCBI assembly summary files use tab-separated values with the following relevant columns:

Column 0: Assembly accession
Column 6: Taxonomic ID (used for filtering)
Column 7: Species taxonomic ID
Column 8: Organism name

Taxonomic Tree Format

Tree files (.taxtree.gz) contain serialized taxonomic hierarchies with node relationships, levels, and name mappings. Generated by TaxTree from NCBI taxonomy dumps.

GI Table Format

Translation tables map GenInfo identifiers to NCBI taxonomic IDs for legacy sequence identification. Generated using gitable.sh from NCBI gi_taxid_nucl.dmp files.

Resource Requirements

File Locations

For Genepool users, pre-built taxonomic resources are available at:

/global/projectb/sandbox/gaag/bbtools/tax

Building Custom Resources

For non-Genepool users or to create updated resources:

Taxonomic trees: Use taxtree.sh with NCBI taxonomy dumps
GI tables: Use gitable.sh with NCBI gi_taxid files
Accession tables: Generated automatically from NCBI accession data

Memory Requirements

Small trees (<100K nodes): 2-4 GB RAM
Full NCBI taxonomy (>2M nodes): 4-8 GB RAM
Large assembly summaries (>1M entries): Additional 1-2 GB

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org