FilterAssemblySummary
Filters NCBI assembly summaries according to their taxonomy using taxonomic trees and ID-based filtering. Supports hierarchical taxonomic filtering with customizable taxonomic levels and ancestor requirements. Processes standard NCBI assembly summary files with tab-separated values, extracting taxonomic IDs from column 6 for classification.
Basic Usage
filterassemblysummary.sh in=<input file> out=<output file> tree=<tree file> table=<table file> ids=<numbers> level=<name or number>
Filters NCBI assembly summary files based on taxonomic criteria. Requires a taxonomic tree file and can optionally use GI-to-TaxID translation tables for sequence identification.
Data Sources
FilterAssemblySummary processes standard NCBI assembly summary files available from:
- GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
- GenBank (alternate): ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt
- RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
- RefSeq (alternate): ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt
Parameters
Parameters are organized by their function in the filtering process, following the organization used in the shell script.
Standard parameters
- in=<file>
- Primary input file. Must be an NCBI assembly summary file with tab-separated values. The tool expects taxonomic ID information in column 6 (zero-indexed).
- out=<file>
- Primary output file. Filtered results will be written here, maintaining the original file format but containing only entries that pass the taxonomic filter.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false
Processing parameters
- level=
- Taxonomic level for filtering, such as phylum, class, order, family, genus, or species. Can be specified by name or numeric code. Filtering operates on sequences within the same taxonomic level as the specified IDs. The tool will promote or demote taxonomic nodes to match this level when using hierarchical filtering.
- reqlevel=
- Require nodes to have ancestors at these specific taxonomic levels. Comma-delimited list of level names. For example, reqlevel=species,genus would exclude nodes that are not defined at both the species and genus levels. Uses bitflag representation internally for efficient level checking.
- ids=
- Comma-delimited list of NCBI numeric taxonomic IDs to filter by. These IDs define the taxonomic groups to include or exclude. Can also accept file paths containing taxonomic IDs, with one ID per line. IDs are automatically promoted to the specified taxonomic level if promote=true.
- names=
- Alternative to numeric IDs - comma-delimited list of taxonomic names (such as 'Homo sapiens', 'Bacteria'). Scientific names with spaces require special shell escaping. Names are resolved to taxonomic IDs using the loaded taxonomic tree. Multiple nodes may match ambiguous names.
- include=f
- Controls filter behavior: 'f' (false) will discard entries matching the filter criteria (exclude mode), 't' (true) will keep only entries matching the criteria (include mode). Default: false (exclude mode)
- tree=
- Path to taxonomic tree file created by TaxTree (such as tree.taxtree.gz). Required for taxonomic name resolution and hierarchical filtering. Contains the complete NCBI taxonomic hierarchy with parent-child relationships and taxonomic levels.
- table=
- Path to translation table mapping GI numbers to NCBI taxonomic IDs. Only needed if the assembly summary contains GI numbers instead of direct taxonomic IDs. Generated using gitable.sh from NCBI data.
Advanced filtering parameters
- promote=t
- Enable hierarchical promotion of taxonomic nodes. When true, nodes are promoted to ancestors until they reach the specified taxonomic level. When false, uses exact taxonomic ID matching only. Default: true
- regex=
- Regular expression pattern for name-based filtering. Applied to taxonomic names in addition to ID-based filtering. Uses Java regex syntax. Must match the entire name string.
- contains=
- Substring matching for taxonomic names. Case-insensitive search within taxonomic names. Applied in addition to other filtering criteria.
- requirepresent=t
- Controls behavior when taxonomic nodes are not found in the tree. When true, missing nodes cause the program to exit with an error. When false, missing nodes generate warnings but processing continues. Default: true
- printnodesadded=t
- (printnodes) Print debugging information about taxonomic nodes added to the filter set during hierarchical promotion. Useful for verifying filter behavior. Default: true
Java Parameters
- -Xmx
- Sets Java's memory usage, overriding autodetection. Format: -Xmx20g for 20 GB, -Xmx200m for 200 MB. Maximum is typically 85% of physical memory. Large taxonomic trees may require substantial memory (4GB+ recommended).
- -eoom
- Exit on out-of-memory exception. Requires Java 8u92 or later. Prevents incomplete processing when memory is exhausted during large file operations or tree loading.
- -da
- Disable Java assertions. May provide minor performance improvement in production use, but assertions help catch configuration errors during development.
Examples
Basic taxonomic filtering
filterassemblysummary.sh in=assembly_summary_genbank.txt out=bacteria_assemblies.txt \
tree=tree.taxtree.gz ids=2 level=kingdom
Filters GenBank assembly summaries to include only bacterial assemblies (taxonomic ID 2 = Bacteria) at the kingdom level.
Species-level filtering with multiple organisms
filterassemblysummary.sh in=assembly_summary_refseq.txt out=pathogens.txt \
tree=tree.taxtree.gz ids=562,1428,210007 level=species include=t
Includes only assemblies from specific pathogenic species: E. coli (562), Streptococcus pneumoniae (1428), and Streptococcus agalactiae (210007).
Name-based filtering with ancestor requirements
filterassemblysummary.sh in=assembly_summary_genbank.txt out=mammals.txt \
tree=tree.taxtree.gz names="Mammalia" level=class \
reqlevel=species,genus include=t
Filters for mammalian assemblies, requiring that entries have taxonomic definitions at both species and genus levels.
Exclusion filtering with regex patterns
filterassemblysummary.sh in=assembly_summary_genbank.txt out=no_env_samples.txt \
tree=tree.taxtree.gz regex=".*environmental.*" include=f
Excludes environmental samples by filtering out entries with "environmental" in their taxonomic names.
Algorithm Details
Input Processing Architecture
FilterAssemblySummary.processLine() implements single-pass line processing using TextFile.nextLine() for streaming input. The tool parses tab-separated values through String.split("\t"), extracting taxonomic IDs from column index 6 using Integer.parseInt(). Comment lines starting with '#' return null from processLine() and are automatically excluded from output.
Taxonomic Filtering Implementation
The core filtering operates through TaxFilter.passesFilter(int) method, which queries TaxTree.getNode(int) for taxonomic node retrieval. Each TaxNode contains:
- id: Primary NCBI taxonomic identifier
- pid: Parent node identifier for tree traversal
- level: Integer taxonomic rank for level-based filtering
- levelExtended: Extended level representation via TaxTree.levelToExtended()
- maxChildLevelExtended: Subtree pruning optimization boundary
Hierarchical Promotion Strategy
When promote=true, TaxFilter.addNode() implements tree traversal using parent-child navigation:
- Level Resolution: TaxTree.parseLevelExtended() converts level names to extended integers
- Node Traversal: while(tn.id!=tn.pid && tn.levelExtended<taxLevelE) loop ascends tree
- Level Matching: Comparison tn.levelExtended<=taxLevelE determines inclusion boundary
- Set Population: HashSet.add(tn.id) accumulates promoted nodes
Multi-Criteria Filtering Engine
TaxFilter.passesFilter() implements compound boolean evaluation through multiple filter stages:
- ID-based filtering: HashSet.contains(tn.id) provides O(1) taxonomic ID lookup
- Name-based filtering: TaxTree.parseNodeFromHeader() and TaxTree.getNodeByName() for string resolution
- Regex filtering: Pattern.compile().matcher().matches() using pre-compiled regex patterns
- Substring filtering: String.toLowerCase().contains() for case-insensitive matching
- Ancestor requirements: Bitwise operations (levels&reqLevels)==reqLevels validate taxonomic completeness
Performance Optimizations
Multiple algorithmic strategies optimize large dataset processing:
- HashSet-based lookups: Java HashSet provides O(1) average taxonomic ID verification
- Compiled regex patterns: Pattern.compile() pre-compilation avoids repeated compilation overhead
- Subtree pruning: tn.maxChildLevelExtended<=maxChildLevelExtended comparison eliminates unnecessary tree traversal
- Stream processing: TextFile line-by-line reading prevents full file memory loading
- Bitflag operations: Integer bitwise AND (1<<tn.level) for level requirement validation
Memory Management
Explicit memory management prevents heap exhaustion during large file processing:
- Default allocation: -Xmx4g and -Xms4g heap initialization in shell script
- Buffer management: Shared.capBuffers(4) limits concurrent I/O buffer allocation
- Garbage collection: System.gc() calls in TaxFilter.loadAccession() after data structure loading
- Streaming I/O: TextFile and TextStreamWriter prevent full file loading
Error Handling and Validation
Comprehensive validation ensures data integrity through multiple safety mechanisms:
- Missing nodes: REQUIRE_PRESENT static flag triggers KillSwitch.kill() for unresolved taxonomic IDs
- File validation: Tools.testOutputFiles() and FileFormat.testInput() verify accessibility
- Format validation: assert(split.length>6) ensures proper tab-delimited structure
- Tree consistency: TaxTree.loadTaxTree() validates taxonomic hierarchy completeness
File Format Specifications
Input Format
NCBI assembly summary files use tab-separated values with the following relevant columns:
- Column 0: Assembly accession
- Column 6: Taxonomic ID (used for filtering)
- Column 7: Species taxonomic ID
- Column 8: Organism name
Taxonomic Tree Format
Tree files (.taxtree.gz) contain serialized taxonomic hierarchies with node relationships, levels, and name mappings. Generated by TaxTree from NCBI taxonomy dumps.
GI Table Format
Translation tables map GenInfo identifiers to NCBI taxonomic IDs for legacy sequence identification. Generated using gitable.sh from NCBI gi_taxid_nucl.dmp files.
Resource Requirements
File Locations
For Genepool users, pre-built taxonomic resources are available at:
/global/projectb/sandbox/gaag/bbtools/tax
Building Custom Resources
For non-Genepool users or to create updated resources:
- Taxonomic trees: Use taxtree.sh with NCBI taxonomy dumps
- GI tables: Use gitable.sh with NCBI gi_taxid files
- Accession tables: Generated automatically from NCBI accession data
Memory Requirements
- Small trees (<100K nodes): 2-4 GB RAM
- Full NCBI taxonomy (>2M nodes): 4-8 GB RAM
- Large assembly summaries (>1M entries): Additional 1-2 GB
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org