SplitByTaxa

Script: splitbytaxa.sh Package: tax Class: SplitByTaxa.java

Splits sequences according to their taxonomy, as determined by the sequence name. Sequences should be labeled with a gi number, NCBI taxID, or species name. Creates separate output files for each taxonomic group at a specified taxonomic level using TaxTree.parseNodeFromHeader() for identifier extraction and HashMap-based stream management for file organization.

Basic Usage

splitbytaxa.sh in=&lt;input file&gt; out=&lt;output pattern&gt; tree=&lt;tree file&gt; table=&lt;table file&gt; level=&lt;name or number&gt;

Input may be fasta or fastq, compressed or uncompressed. The output pattern must contain a % symbol which will be replaced with taxonomic names to create separate files for each taxonomic group.

Parameters

Parameters control TaxTree reference file loading via TaxFilter.loadTree() and TaxFilter.loadGiTable(), taxonomic level specification through TaxTree.parseLevelExtended(), and ConcurrentReadOutputStream configuration for multi-file output generation.

Standard parameters

in=<file>: Primary input file. Sequences should be labeled with gi numbers, NCBI taxIDs, or species names in their headers for taxonomic identification. Supports fasta and fastq formats, compressed or uncompressed.
out=<file>: Output file pattern; must contain % symbol. The % will be replaced with the taxonomic name (spaces replaced with underscores, path separators removed) to create separate output files for each taxonomic group. Example: "split_%.fq" creates files like "split_Bacteria.fq", "split_Archaea.fq", etc.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default is false for safety.
showspeed=t: (ss) Set to 'f' to suppress display of processing speed statistics during execution. Default is true to show progress information.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level for output files; lower compression is faster. Only applies when writing compressed output. Default compression level is 2.

Processing parameters

level=phylum: Taxonomic level for splitting sequences, such as "phylum", "class", "order", "family", "genus", or "species". Can also be specified as a number corresponding to the taxonomic level. Sequences will be grouped according to their classification at this taxonomic level. Default is "phylum".
tree=: A taxonomic tree file made by TaxTree, such as tree.taxtree.gz. This file contains the hierarchical taxonomic structure needed for classification. On Genepool systems, use 'tree=auto' to automatically locate the standard taxonomic tree. Required for taxonomic classification.
table=: A table file translating gi numbers to NCBI taxIDs, typically ending in .int1d.gz. Only needed if sequences are identified by gi numbers rather than taxIDs or species names. On Genepool systems, use 'table=auto' to automatically locate the standard gi table. Note: Tree and table files are available in /global/projectb/sandbox/gaag/bbtools/tax on Genepool.

Java Parameters

-Xmx: This will set Java's memory usage, overriding automatic memory detection. Specify like -Xmx20g for 20 gigabytes of RAM, or -Xmx200m for 200 megabytes. The maximum is typically 85% of physical memory. Default is 4g for this tool.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+ and helps prevent hanging when insufficient memory is available.
-da: Disable Java assertions. May provide a small performance boost in production use but reduces debugging information if errors occur.

Examples

Basic Taxonomic Splitting

splitbytaxa.sh in=sequences.fq out=split_%.fq tree=auto table=auto level=phylum

Splits sequences in sequences.fq by phylum, creating separate files like split_Bacteria.fq, split_Archaea.fq, etc. Uses automatic taxonomic reference files on Genepool systems.

Genus-Level Splitting

splitbytaxa.sh in=metagenome.fa out=genus_%.fa tree=taxonomy.taxtree.gz level=genus

Splits sequences at the genus level using a custom taxonomic tree file, creating files like genus_Escherichia.fa, genus_Bacillus.fa, etc.

Species-Level Splitting with Custom Parameters

splitbytaxa.sh in=reads.fastq.gz out=species_%.fastq.gz tree=tree.taxtree.gz table=gi_table.int1d.gz level=species overwrite=t ziplevel=6

Splits compressed sequences at species level, creating compressed output files with higher compression (level 6), allowing overwrite of existing files.

High Memory Usage

splitbytaxa.sh in=large_dataset.fq out=split_%.fq tree=auto table=auto level=family -Xmx32g

Processes a large dataset with 32GB memory allocation, splitting sequences at the family level.

Algorithm Details

Taxonomic Classification Implementation

SplitByTaxa processes sequences through TaxTree.parseNodeFromHeader() method calls to extract taxonomic identifiers from sequence headers, then uses tree.getNode() traversal to classify sequences at specified taxonomic levels.

Sequence Processing Pipeline

Header Parsing: Calls tree.parseNodeFromHeader(r1.id, true) to extract gi numbers, NCBI taxIDs, or species names from sequence headers
Fallback Resolution: Uses tree.getNodeByName(r1.id) if initial parsing fails
Hierarchical Traversal: Executes while loop (tn.levelExtended<taxLevelE && tn.id!=tn.pid) to traverse up taxonomic tree until reaching target level
Unknown Assignment: Creates TaxNode(-99, -99, TaxTree.LIFE, TaxTree.LIFE_E, "UNKNOWN") for unclassifiable sequences

Dynamic Output Stream Management

Uses HashMap<String, ConcurrentReadOutputStream> for on-demand output file creation with ConcurrentReadOutputStream.getStream() initialization:

Stream Caching: HashMap stores ConcurrentReadOutputStream instances keyed by TaxNode.name
File Naming Logic: Uses tn.name.replaceAll("\\s+", "_").replaceAll("[/\\\\]", "") for filesystem-safe names
Buffer Configuration: Sets buff=4 for ConcurrentReadOutputStream buffer management
Stream Initialization: Calls ros.start() to activate buffered I/O streams

TaxTree Integration

Uses TaxFilter.loadTree() and TaxFilter.loadGiTable() for taxonomic reference data loading:

Tree Loading: TaxFilter.loadTree(taxTreeFile) initializes taxonomic hierarchy from *.taxtree.gz files
GI Mapping: TaxFilter.loadGiTable(giTableFile) loads gi-to-taxid translation from *.int1d.gz files
Level Parsing: TaxTree.parseLevelExtended(b) converts string taxonomic levels to integer constants
Node Traversal: Uses tn.pid references for parent node navigation through taxonomic hierarchy

I/O Implementation

Memory Allocation: Default -Xmx4g with calcXmx() dynamic memory calculation using freeRam(1000m, 84) for 84% RAM utilization
Concurrent Processing: ConcurrentReadInputStream.getReadInputStream() with maxReads parameter for multi-threaded input processing
Buffer Strategy: Shared.capBuffers(4) configures 4-buffer system for stream operations
Format Handling: FileFormat.testInput() and FileFormat.testOutput() with FASTQ default for format auto-detection

File Organization Implementation

FileFormat objects handle input/output format detection and compression:

Pattern Substitution: out1.replaceFirst("%", tn.name) creates unique output filenames per taxonomic group
Format Preservation: FileFormat.testOutput() maintains input format (FASTA/FASTQ) in output files
Paired-End Handling: Supports ffout1 and ffout2 for paired-end read preservation
Lazy Creation: Output files created only when first sequence from taxonomic group is encountered

Notes and Considerations

Input Requirements

Sequences must have taxonomic identifiers in their headers (gi numbers, taxIDs, or species names)
Requires taxonomic reference files: tree file is mandatory, gi table only needed for gi-based identification
Output pattern must contain exactly one % symbol for taxonomic name substitution

File Locations (Genepool Systems)

Standard taxonomic reference files are available at:

Location: /global/projectb/sandbox/gaag/bbtools/tax
Tree files: *.taxtree.gz (taxonomic hierarchy)
GI tables: *.int1d.gz (gi number to taxID translation)
Usage: Use 'tree=auto table=auto' on Genepool systems

Creating Custom Reference Files

For non-Genepool users or custom taxonomies:

Tree creation: Use taxtree.sh to create custom taxonomic trees
GI table creation: Use gitable.sh to create gi-to-taxID translation tables
Compatibility: Ensure reference files are compatible with NCBI taxonomy structure

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org