TaxSize

Script: taxsize.sh Package: tax Class: TaxSize.java

Calculates the amount of sequence per taxonomic node from FASTA files annotated with taxonomic information in headers. Processes RefSeq-style headers to aggregate sequence lengths and counts by taxonomic ID, providing both direct and cumulative statistics through the taxonomic hierarchy.

Basic Usage

taxsize.sh in=<file> out=<file> tree=<file>

TaxSize requires a taxonomically annotated FASTA file (typically modified RefSeq), an output file for the size statistics, and a taxonomic tree file for hierarchy traversal.

Parameters

Parameters are organized into core input/output parameters and Java memory management options. All core parameters are required for proper operation.

Core Parameters

in=: Input FASTA file annotated with taxonomic data in headers, such as modified RefSeq format. Headers must contain taxonomic information that can be parsed to identify TaxIDs. Required parameter.
out=: Output file location to write the taxonomic size statistics. Output format includes columns for taxID, bases, basesC (cumulative), seqs, seqsC (cumulative), and nodesC (cumulative nodes). Required parameter.
tree= or taxtree=: Path to the taxonomic tree file used for hierarchy traversal and cumulative calculations. Can be set to "auto" to use the default tree file location. Required parameter.
verbose=f: Enable verbose output for debugging and detailed processing information. Default: false

Java Parameters

-Xmx: Sets Java's memory usage, overriding autodetection. Format: -Xmx20g for 20 gigabytes, -Xmx200m for 200 megabytes. Maximum is typically 85% of physical memory. Default: 2000m
-eoom: Causes the process to exit immediately if an out-of-memory exception occurs. Useful for preventing system instability. Requires Java 8u92 or later.
-da: Disables Java assertions for potentially improved performance in production environments.

Examples

Basic Taxonomic Size Analysis

taxsize.sh in=refseq_annotated.fasta out=taxonomy_sizes.txt tree=auto

Processes a RefSeq-annotated FASTA file to calculate sequence sizes per taxonomic node using the default taxonomic tree.

Custom Tree Analysis

taxsize.sh in=sequences.fasta out=sizes.tsv tree=custom_taxonomy.tree verbose=t

Analyzes sequences with a custom taxonomic tree file and enables verbose output for detailed processing information.

High-Memory Analysis

taxsize.sh in=large_dataset.fasta out=results.txt tree=auto -Xmx50g

Processes a large dataset with increased memory allocation (50GB) for handling extensive taxonomic databases.

Output Format

TaxSize generates tab-delimited output with the following columns:

taxID: Taxonomic identifier
bases: Direct sequence length for this taxonomic node
basesC: Cumulative sequence length including all descendant nodes
seqs: Direct sequence count for this taxonomic node
seqsC: Cumulative sequence count including all descendant nodes
nodesC: Cumulative count of taxonomic nodes in the subtree

Sample Output

#taxID	bases	basesC	seqs	seqsC	nodesC
1	0	458923156	0	12453	8734
2	125467	125467	45	45	1
131567	0	458797689	0	12408	8733

Shows hierarchical accumulation where higher-level taxa (like cellular organisms, taxID 131567) contain cumulative statistics from all descendants.

Algorithm Details

Two-Phase Processing Strategy: TaxSize implements a dual-phase algorithm using sequential file processing followed by hierarchical percolation:

Phase 1: Sequential Processing

Header Parsing: Extracts taxonomic information from FASTA headers using TaxTree.parseNodeFromHeader()
Direct Accumulation: Maintains IntLongHashMap structures for immediate size and count tracking per taxonomic node
Streaming Architecture: Processes files line-by-line to minimize memory footprint for large datasets
Sequence Tracking: Accumulates both base count and sequence count for each identified taxonomic node

Phase 2: Hierarchical Percolation

Bottom-Up Traversal: Uses percolateUp() method to propagate statistics through taxonomic hierarchy
Cumulative Statistics: Calculates cumulative values (basesC, seqsC, nodesC) by traversing parent relationships
Tree Navigation: Uses TaxTree.getParentID() method to traverse parent-child relationships in the taxonomic hierarchy
Termination Detection: Stops traversal when parent ID equals current ID (root detection)

Memory Management Strategy

Hash Map Utilization: Uses IntLongHashMap data structure for integer-to-long key-value mappings
Separate Tracking: Maintains distinct maps for direct vs cumulative statistics (sizeMap vs cSizeMap)
Configurable Memory: Supports custom memory allocation via -Xmx parameter for large taxonomic databases
ByteFile Integration: Uses BBTools' ByteFile class for line-by-line FASTA file reading and processing

Performance Characteristics

Linear Complexity: O(n) processing time where n is the number of sequences in input file
Hierarchical Overhead: Additional O(t) time for tree traversal where t is taxonomic tree depth
Memory Scaling: Memory usage scales with unique taxonomic node count, not sequence count
I/O Support: Handles compressed input files automatically and supports concurrent I/O operations

Taxonomic Integration

TaxSize uses BBTools' taxonomic infrastructure components:

TaxTree Compatibility: Uses BBTools' TaxTree class for taxonomic tree data structure access
Header Parsing: Supports standard RefSeq and NCBI taxonomic annotation formats
Hierarchy Awareness: Properly handles taxonomic ranks and parent-child relationships
Default Tree Support: Automatic detection and loading of standard taxonomic tree files

Performance Notes

Memory Requirements: Memory usage depends on taxonomic diversity rather than sequence count
Large Datasets: For datasets with >100,000 unique taxa, consider increasing -Xmx to 8GB or more
I/O Support: Compressed input files are automatically detected and processed
Tree Loading: Taxonomic tree loading is a one-time operation; tree size affects startup time

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org