TaxSize

Script: taxsize.sh Package: tax Class: TaxSize.java

Calculates the amount of sequence per taxonomic node from FASTA files annotated with taxonomic information in headers. Processes RefSeq-style headers to aggregate sequence lengths and counts by taxonomic ID, providing both direct and cumulative statistics through the taxonomic hierarchy.

Basic Usage

taxsize.sh in=<file> out=<file> tree=<file>

TaxSize requires a taxonomically annotated FASTA file (typically modified RefSeq), an output file for the size statistics, and a taxonomic tree file for hierarchy traversal.

Parameters

Parameters are organized into core input/output parameters and Java memory management options. All core parameters are required for proper operation.

Core Parameters

in=
Input FASTA file annotated with taxonomic data in headers, such as modified RefSeq format. Headers must contain taxonomic information that can be parsed to identify TaxIDs. Required parameter.
out=
Output file location to write the taxonomic size statistics. Output format includes columns for taxID, bases, basesC (cumulative), seqs, seqsC (cumulative), and nodesC (cumulative nodes). Required parameter.
tree= or taxtree=
Path to the taxonomic tree file used for hierarchy traversal and cumulative calculations. Can be set to "auto" to use the default tree file location. Required parameter.
verbose=f
Enable verbose output for debugging and detailed processing information. Default: false

Java Parameters

-Xmx
Sets Java's memory usage, overriding autodetection. Format: -Xmx20g for 20 gigabytes, -Xmx200m for 200 megabytes. Maximum is typically 85% of physical memory. Default: 2000m
-eoom
Causes the process to exit immediately if an out-of-memory exception occurs. Useful for preventing system instability. Requires Java 8u92 or later.
-da
Disables Java assertions for potentially improved performance in production environments.

Examples

Basic Taxonomic Size Analysis

taxsize.sh in=refseq_annotated.fasta out=taxonomy_sizes.txt tree=auto

Processes a RefSeq-annotated FASTA file to calculate sequence sizes per taxonomic node using the default taxonomic tree.

Custom Tree Analysis

taxsize.sh in=sequences.fasta out=sizes.tsv tree=custom_taxonomy.tree verbose=t

Analyzes sequences with a custom taxonomic tree file and enables verbose output for detailed processing information.

High-Memory Analysis

taxsize.sh in=large_dataset.fasta out=results.txt tree=auto -Xmx50g

Processes a large dataset with increased memory allocation (50GB) for handling extensive taxonomic databases.

Output Format

TaxSize generates tab-delimited output with the following columns:

Sample Output

#taxID	bases	basesC	seqs	seqsC	nodesC
1	0	458923156	0	12453	8734
2	125467	125467	45	45	1
131567	0	458797689	0	12408	8733

Shows hierarchical accumulation where higher-level taxa (like cellular organisms, taxID 131567) contain cumulative statistics from all descendants.

Algorithm Details

Two-Phase Processing Strategy: TaxSize implements a dual-phase algorithm using sequential file processing followed by hierarchical percolation:

Phase 1: Sequential Processing

Phase 2: Hierarchical Percolation

Memory Management Strategy

Performance Characteristics

Taxonomic Integration

TaxSize uses BBTools' taxonomic infrastructure components:

Performance Notes

Support

For questions and support: