SplitByTaxa
Splits sequences according to their taxonomy, as determined by the sequence name. Sequences should be labeled with a gi number, NCBI taxID, or species name. Creates separate output files for each taxonomic group at a specified taxonomic level using TaxTree.parseNodeFromHeader() for identifier extraction and HashMap-based stream management for file organization.
Basic Usage
splitbytaxa.sh in=<input file> out=<output pattern> tree=<tree file> table=<table file> level=<name or number>
Input may be fasta or fastq, compressed or uncompressed. The output pattern must contain a % symbol which will be replaced with taxonomic names to create separate files for each taxonomic group.
Parameters
Parameters control TaxTree reference file loading via TaxFilter.loadTree() and TaxFilter.loadGiTable(), taxonomic level specification through TaxTree.parseLevelExtended(), and ConcurrentReadOutputStream configuration for multi-file output generation.
Standard parameters
- in=<file>
- Primary input file. Sequences should be labeled with gi numbers, NCBI taxIDs, or species names in their headers for taxonomic identification. Supports fasta and fastq formats, compressed or uncompressed.
- out=<file>
- Output file pattern; must contain % symbol. The % will be replaced with the taxonomic name (spaces replaced with underscores, path separators removed) to create separate output files for each taxonomic group. Example: "split_%.fq" creates files like "split_Bacteria.fq", "split_Archaea.fq", etc.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default is false for safety.
- showspeed=t
- (ss) Set to 'f' to suppress display of processing speed statistics during execution. Default is true to show progress information.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level for output files; lower compression is faster. Only applies when writing compressed output. Default compression level is 2.
Processing parameters
- level=phylum
- Taxonomic level for splitting sequences, such as "phylum", "class", "order", "family", "genus", or "species". Can also be specified as a number corresponding to the taxonomic level. Sequences will be grouped according to their classification at this taxonomic level. Default is "phylum".
- tree=
- A taxonomic tree file made by TaxTree, such as tree.taxtree.gz. This file contains the hierarchical taxonomic structure needed for classification. On Genepool systems, use 'tree=auto' to automatically locate the standard taxonomic tree. Required for taxonomic classification.
- table=
- A table file translating gi numbers to NCBI taxIDs, typically ending in .int1d.gz. Only needed if sequences are identified by gi numbers rather than taxIDs or species names. On Genepool systems, use 'table=auto' to automatically locate the standard gi table. Note: Tree and table files are available in /global/projectb/sandbox/gaag/bbtools/tax on Genepool.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding automatic memory detection. Specify like -Xmx20g for 20 gigabytes of RAM, or -Xmx200m for 200 megabytes. The maximum is typically 85% of physical memory. Default is 4g for this tool.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+ and helps prevent hanging when insufficient memory is available.
- -da
- Disable Java assertions. May provide a small performance boost in production use but reduces debugging information if errors occur.
Examples
Basic Taxonomic Splitting
splitbytaxa.sh in=sequences.fq out=split_%.fq tree=auto table=auto level=phylum
Splits sequences in sequences.fq by phylum, creating separate files like split_Bacteria.fq, split_Archaea.fq, etc. Uses automatic taxonomic reference files on Genepool systems.
Genus-Level Splitting
splitbytaxa.sh in=metagenome.fa out=genus_%.fa tree=taxonomy.taxtree.gz level=genus
Splits sequences at the genus level using a custom taxonomic tree file, creating files like genus_Escherichia.fa, genus_Bacillus.fa, etc.
Species-Level Splitting with Custom Parameters
splitbytaxa.sh in=reads.fastq.gz out=species_%.fastq.gz tree=tree.taxtree.gz table=gi_table.int1d.gz level=species overwrite=t ziplevel=6
Splits compressed sequences at species level, creating compressed output files with higher compression (level 6), allowing overwrite of existing files.
High Memory Usage
splitbytaxa.sh in=large_dataset.fq out=split_%.fq tree=auto table=auto level=family -Xmx32g
Processes a large dataset with 32GB memory allocation, splitting sequences at the family level.
Algorithm Details
Taxonomic Classification Implementation
SplitByTaxa processes sequences through TaxTree.parseNodeFromHeader() method calls to extract taxonomic identifiers from sequence headers, then uses tree.getNode() traversal to classify sequences at specified taxonomic levels.
Sequence Processing Pipeline
- Header Parsing: Calls tree.parseNodeFromHeader(r1.id, true) to extract gi numbers, NCBI taxIDs, or species names from sequence headers
- Fallback Resolution: Uses tree.getNodeByName(r1.id) if initial parsing fails
- Hierarchical Traversal: Executes while loop (tn.levelExtended<taxLevelE && tn.id!=tn.pid) to traverse up taxonomic tree until reaching target level
- Unknown Assignment: Creates TaxNode(-99, -99, TaxTree.LIFE, TaxTree.LIFE_E, "UNKNOWN") for unclassifiable sequences
Dynamic Output Stream Management
Uses HashMap<String, ConcurrentReadOutputStream> for on-demand output file creation with ConcurrentReadOutputStream.getStream() initialization:
- Stream Caching: HashMap stores ConcurrentReadOutputStream instances keyed by TaxNode.name
- File Naming Logic: Uses tn.name.replaceAll("\\s+", "_").replaceAll("[/\\\\]", "") for filesystem-safe names
- Buffer Configuration: Sets buff=4 for ConcurrentReadOutputStream buffer management
- Stream Initialization: Calls ros.start() to activate buffered I/O streams
TaxTree Integration
Uses TaxFilter.loadTree() and TaxFilter.loadGiTable() for taxonomic reference data loading:
- Tree Loading: TaxFilter.loadTree(taxTreeFile) initializes taxonomic hierarchy from *.taxtree.gz files
- GI Mapping: TaxFilter.loadGiTable(giTableFile) loads gi-to-taxid translation from *.int1d.gz files
- Level Parsing: TaxTree.parseLevelExtended(b) converts string taxonomic levels to integer constants
- Node Traversal: Uses tn.pid references for parent node navigation through taxonomic hierarchy
I/O Implementation
- Memory Allocation: Default -Xmx4g with calcXmx() dynamic memory calculation using freeRam(1000m, 84) for 84% RAM utilization
- Concurrent Processing: ConcurrentReadInputStream.getReadInputStream() with maxReads parameter for multi-threaded input processing
- Buffer Strategy: Shared.capBuffers(4) configures 4-buffer system for stream operations
- Format Handling: FileFormat.testInput() and FileFormat.testOutput() with FASTQ default for format auto-detection
File Organization Implementation
FileFormat objects handle input/output format detection and compression:
- Pattern Substitution: out1.replaceFirst("%", tn.name) creates unique output filenames per taxonomic group
- Format Preservation: FileFormat.testOutput() maintains input format (FASTA/FASTQ) in output files
- Paired-End Handling: Supports ffout1 and ffout2 for paired-end read preservation
- Lazy Creation: Output files created only when first sequence from taxonomic group is encountered
Notes and Considerations
Input Requirements
- Sequences must have taxonomic identifiers in their headers (gi numbers, taxIDs, or species names)
- Requires taxonomic reference files: tree file is mandatory, gi table only needed for gi-based identification
- Output pattern must contain exactly one % symbol for taxonomic name substitution
File Locations (Genepool Systems)
Standard taxonomic reference files are available at:
- Location: /global/projectb/sandbox/gaag/bbtools/tax
- Tree files: *.taxtree.gz (taxonomic hierarchy)
- GI tables: *.int1d.gz (gi number to taxID translation)
- Usage: Use 'tree=auto table=auto' on Genepool systems
Creating Custom Reference Files
For non-Genepool users or custom taxonomies:
- Tree creation: Use taxtree.sh to create custom taxonomic trees
- GI table creation: Use gitable.sh to create gi-to-taxID translation tables
- Compatibility: Ensure reference files are compatible with NCBI taxonomy structure
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org