SplitByTaxa

Script: splitbytaxa.sh Package: tax Class: SplitByTaxa.java

Splits sequences according to their taxonomy, as determined by the sequence name. Sequences should be labeled with a gi number, NCBI taxID, or species name. Creates separate output files for each taxonomic group at a specified taxonomic level using TaxTree.parseNodeFromHeader() for identifier extraction and HashMap-based stream management for file organization.

Basic Usage

splitbytaxa.sh in=<input file> out=<output pattern> tree=<tree file> table=<table file> level=<name or number>

Input may be fasta or fastq, compressed or uncompressed. The output pattern must contain a % symbol which will be replaced with taxonomic names to create separate files for each taxonomic group.

Parameters

Parameters control TaxTree reference file loading via TaxFilter.loadTree() and TaxFilter.loadGiTable(), taxonomic level specification through TaxTree.parseLevelExtended(), and ConcurrentReadOutputStream configuration for multi-file output generation.

Standard parameters

in=<file>
Primary input file. Sequences should be labeled with gi numbers, NCBI taxIDs, or species names in their headers for taxonomic identification. Supports fasta and fastq formats, compressed or uncompressed.
out=<file>
Output file pattern; must contain % symbol. The % will be replaced with the taxonomic name (spaces replaced with underscores, path separators removed) to create separate output files for each taxonomic group. Example: "split_%.fq" creates files like "split_Bacteria.fq", "split_Archaea.fq", etc.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default is false for safety.
showspeed=t
(ss) Set to 'f' to suppress display of processing speed statistics during execution. Default is true to show progress information.
ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level for output files; lower compression is faster. Only applies when writing compressed output. Default compression level is 2.

Processing parameters

level=phylum
Taxonomic level for splitting sequences, such as "phylum", "class", "order", "family", "genus", or "species". Can also be specified as a number corresponding to the taxonomic level. Sequences will be grouped according to their classification at this taxonomic level. Default is "phylum".
tree=
A taxonomic tree file made by TaxTree, such as tree.taxtree.gz. This file contains the hierarchical taxonomic structure needed for classification. On Genepool systems, use 'tree=auto' to automatically locate the standard taxonomic tree. Required for taxonomic classification.
table=
A table file translating gi numbers to NCBI taxIDs, typically ending in .int1d.gz. Only needed if sequences are identified by gi numbers rather than taxIDs or species names. On Genepool systems, use 'table=auto' to automatically locate the standard gi table. Note: Tree and table files are available in /global/projectb/sandbox/gaag/bbtools/tax on Genepool.

Java Parameters

-Xmx
This will set Java's memory usage, overriding automatic memory detection. Specify like -Xmx20g for 20 gigabytes of RAM, or -Xmx200m for 200 megabytes. The maximum is typically 85% of physical memory. Default is 4g for this tool.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+ and helps prevent hanging when insufficient memory is available.
-da
Disable Java assertions. May provide a small performance boost in production use but reduces debugging information if errors occur.

Examples

Basic Taxonomic Splitting

splitbytaxa.sh in=sequences.fq out=split_%.fq tree=auto table=auto level=phylum

Splits sequences in sequences.fq by phylum, creating separate files like split_Bacteria.fq, split_Archaea.fq, etc. Uses automatic taxonomic reference files on Genepool systems.

Genus-Level Splitting

splitbytaxa.sh in=metagenome.fa out=genus_%.fa tree=taxonomy.taxtree.gz level=genus

Splits sequences at the genus level using a custom taxonomic tree file, creating files like genus_Escherichia.fa, genus_Bacillus.fa, etc.

Species-Level Splitting with Custom Parameters

splitbytaxa.sh in=reads.fastq.gz out=species_%.fastq.gz tree=tree.taxtree.gz table=gi_table.int1d.gz level=species overwrite=t ziplevel=6

Splits compressed sequences at species level, creating compressed output files with higher compression (level 6), allowing overwrite of existing files.

High Memory Usage

splitbytaxa.sh in=large_dataset.fq out=split_%.fq tree=auto table=auto level=family -Xmx32g

Processes a large dataset with 32GB memory allocation, splitting sequences at the family level.

Algorithm Details

Taxonomic Classification Implementation

SplitByTaxa processes sequences through TaxTree.parseNodeFromHeader() method calls to extract taxonomic identifiers from sequence headers, then uses tree.getNode() traversal to classify sequences at specified taxonomic levels.

Sequence Processing Pipeline

Dynamic Output Stream Management

Uses HashMap<String, ConcurrentReadOutputStream> for on-demand output file creation with ConcurrentReadOutputStream.getStream() initialization:

TaxTree Integration

Uses TaxFilter.loadTree() and TaxFilter.loadGiTable() for taxonomic reference data loading:

I/O Implementation

File Organization Implementation

FileFormat objects handle input/output format detection and compression:

Notes and Considerations

Input Requirements

File Locations (Genepool Systems)

Standard taxonomic reference files are available at:

Creating Custom Reference Files

For non-Genepool users or custom taxonomies:

Support

For questions and support: