TaxSize
Calculates the amount of sequence per taxonomic node from FASTA files annotated with taxonomic information in headers. Processes RefSeq-style headers to aggregate sequence lengths and counts by taxonomic ID, providing both direct and cumulative statistics through the taxonomic hierarchy.
Basic Usage
taxsize.sh in=<file> out=<file> tree=<file>
TaxSize requires a taxonomically annotated FASTA file (typically modified RefSeq), an output file for the size statistics, and a taxonomic tree file for hierarchy traversal.
Parameters
Parameters are organized into core input/output parameters and Java memory management options. All core parameters are required for proper operation.
Core Parameters
- in=
- Input FASTA file annotated with taxonomic data in headers, such as modified RefSeq format. Headers must contain taxonomic information that can be parsed to identify TaxIDs. Required parameter.
- out=
- Output file location to write the taxonomic size statistics. Output format includes columns for taxID, bases, basesC (cumulative), seqs, seqsC (cumulative), and nodesC (cumulative nodes). Required parameter.
- tree= or taxtree=
- Path to the taxonomic tree file used for hierarchy traversal and cumulative calculations. Can be set to "auto" to use the default tree file location. Required parameter.
- verbose=f
- Enable verbose output for debugging and detailed processing information. Default: false
Java Parameters
- -Xmx
- Sets Java's memory usage, overriding autodetection. Format: -Xmx20g for 20 gigabytes, -Xmx200m for 200 megabytes. Maximum is typically 85% of physical memory. Default: 2000m
- -eoom
- Causes the process to exit immediately if an out-of-memory exception occurs. Useful for preventing system instability. Requires Java 8u92 or later.
- -da
- Disables Java assertions for potentially improved performance in production environments.
Examples
Basic Taxonomic Size Analysis
taxsize.sh in=refseq_annotated.fasta out=taxonomy_sizes.txt tree=auto
Processes a RefSeq-annotated FASTA file to calculate sequence sizes per taxonomic node using the default taxonomic tree.
Custom Tree Analysis
taxsize.sh in=sequences.fasta out=sizes.tsv tree=custom_taxonomy.tree verbose=t
Analyzes sequences with a custom taxonomic tree file and enables verbose output for detailed processing information.
High-Memory Analysis
taxsize.sh in=large_dataset.fasta out=results.txt tree=auto -Xmx50g
Processes a large dataset with increased memory allocation (50GB) for handling extensive taxonomic databases.
Output Format
TaxSize generates tab-delimited output with the following columns:
- taxID: Taxonomic identifier
- bases: Direct sequence length for this taxonomic node
- basesC: Cumulative sequence length including all descendant nodes
- seqs: Direct sequence count for this taxonomic node
- seqsC: Cumulative sequence count including all descendant nodes
- nodesC: Cumulative count of taxonomic nodes in the subtree
Sample Output
#taxID bases basesC seqs seqsC nodesC
1 0 458923156 0 12453 8734
2 125467 125467 45 45 1
131567 0 458797689 0 12408 8733
Shows hierarchical accumulation where higher-level taxa (like cellular organisms, taxID 131567) contain cumulative statistics from all descendants.
Algorithm Details
Two-Phase Processing Strategy: TaxSize implements a dual-phase algorithm using sequential file processing followed by hierarchical percolation:
Phase 1: Sequential Processing
- Header Parsing: Extracts taxonomic information from FASTA headers using TaxTree.parseNodeFromHeader()
- Direct Accumulation: Maintains IntLongHashMap structures for immediate size and count tracking per taxonomic node
- Streaming Architecture: Processes files line-by-line to minimize memory footprint for large datasets
- Sequence Tracking: Accumulates both base count and sequence count for each identified taxonomic node
Phase 2: Hierarchical Percolation
- Bottom-Up Traversal: Uses percolateUp() method to propagate statistics through taxonomic hierarchy
- Cumulative Statistics: Calculates cumulative values (basesC, seqsC, nodesC) by traversing parent relationships
- Tree Navigation: Uses TaxTree.getParentID() method to traverse parent-child relationships in the taxonomic hierarchy
- Termination Detection: Stops traversal when parent ID equals current ID (root detection)
Memory Management Strategy
- Hash Map Utilization: Uses IntLongHashMap data structure for integer-to-long key-value mappings
- Separate Tracking: Maintains distinct maps for direct vs cumulative statistics (sizeMap vs cSizeMap)
- Configurable Memory: Supports custom memory allocation via -Xmx parameter for large taxonomic databases
- ByteFile Integration: Uses BBTools' ByteFile class for line-by-line FASTA file reading and processing
Performance Characteristics
- Linear Complexity: O(n) processing time where n is the number of sequences in input file
- Hierarchical Overhead: Additional O(t) time for tree traversal where t is taxonomic tree depth
- Memory Scaling: Memory usage scales with unique taxonomic node count, not sequence count
- I/O Support: Handles compressed input files automatically and supports concurrent I/O operations
Taxonomic Integration
TaxSize uses BBTools' taxonomic infrastructure components:
- TaxTree Compatibility: Uses BBTools' TaxTree class for taxonomic tree data structure access
- Header Parsing: Supports standard RefSeq and NCBI taxonomic annotation formats
- Hierarchy Awareness: Properly handles taxonomic ranks and parent-child relationships
- Default Tree Support: Automatic detection and loading of standard taxonomic tree files
Performance Notes
- Memory Requirements: Memory usage depends on taxonomic diversity rather than sequence count
- Large Datasets: For datasets with >100,000 unique taxa, consider increasing -Xmx to 8GB or more
- I/O Support: Compressed input files are automatically detected and processed
- Tree Loading: Taxonomic tree loading is a one-time operation; tree size affects startup time
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org