Taxonomy
Prints the full taxonomy of a string. String may be a gi number, NCBI taxID, or Latin name. An NCBI identifier should just be a number or ncbi|number. A gi number should be gi|number. Not: It is more convenient to use taxonomy.jgi-psf.org.
Basic Usage
taxonomy.sh tree=<tree file> <identifier>
taxonomy.sh tree=<tree file> in=<file>
The taxonomy tool accepts organism identifiers as command-line arguments or reads them from an input file. Identifiers can be gi numbers (gi|123456), NCBI taxIDs (9606), or Latin names (homo_sapiens).
Parameters
Parameters control input/output handling, taxonomy database files, output formatting, and filtering options. The tool requires taxonomy database files that can be generated using taxtree.sh and gitable.sh tools.
Processing parameters
- in=<file>
- A file containing named sequences, or just the names. Can be FASTA, FASTQ, or text format.
- out=<file>
- Output file. If blank, use stdout. Results will contain taxonomic classifications.
- tree=<file>
- Specify a TaxTree file like tree.taxtree.gz. On Genepool, use 'auto'. This file contains the taxonomic hierarchy structure.
- gi=<file>
- Specify a gitable file like gitable.int1d.gz. Only needed if gi numbers will be used. On Genepool, use 'auto'. Maps gi numbers to taxIDs.
- accession=
- Specify one or more comma-delimited NCBI accession to taxid files. Only needed if accessions will be used; requires ~45GB of memory. On Genepool, use 'auto'.
- level=null
- Set to a taxonomic level like phylum to just print that level. Options include: species, genus, family, order, class, phylum, kingdom, superkingdom.
- minlevel=-1
- For multi-level printing, do not print levels below this. Use -1 for no minimum restriction.
- maxlevel=life
- For multi-level printing, do not print levels above this. Default is 'life' which includes all levels.
- silva=f
- Parse headers using Silva or semicolon-delimited syntax. Set to true when working with Silva-formatted taxonomic strings.
- taxpath=auto
- Set the path to taxonomy files; auto only works at NERSC. Specify custom path for taxonomy database files.
Additional Processing Parameters
- counts=<file>
- Output file for taxonomic counts. Writes the number of sequences classified at each taxonomic node.
- verbose=f
- Print verbose status messages during processing. Helpful for debugging and monitoring progress.
- table=<file>
- Alias for gi parameter. Specify gitable file for gi number to taxID mapping.
- printname=t
- Print the query name before the taxonomy result. Useful for tracking which results correspond to which queries.
- reverse=t
- Reverse the order of taxonomic levels in output. When true, prints from specific to general (species to kingdom).
- unite=f
- Enable UNITE database mode for fungal taxonomy. Adjusts parsing for UNITE-specific formatting.
- simple=f
- Skip non-canonical taxonomic levels. Only prints standard taxonomic ranks (kingdom, phylum, class, order, family, genus, species).
- column=-1
- If set to a non-negative integer, parse the taxonomy information from this column in a tab-delimited file. Useful for processing structured data files.
- name=<string>
- Specify organism names directly as comma-delimited list. Alternative to providing names as separate arguments.
- names=<string>
- Alias for name parameter. Specify organism names directly as comma-delimited list.
- id=<string>
- Specify organism IDs directly as comma-delimited list. Can be taxIDs, gi numbers, or accessions.
- ids=<string>
- Alias for id parameter. Specify organism IDs directly as comma-delimited list.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Large taxonomy databases require substantial memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing hanging when taxonomy databases exceed available memory.
- -da
- Disable assertions. May provide minor performance improvement in production environments.
Examples
Basic Organism Lookup
taxonomy.sh tree=tree.taxtree.gz homo_sapiens canis_lupus 9606
Look up taxonomy for human (by Latin name), dog (by Latin name), and human (by NCBI taxID).
File-Based Processing
taxonomy.sh tree=tree.taxtree.gz gi=gitable.int1d.gz in=sequences.fasta out=taxonomy_results.txt
Process a FASTA file containing sequences with gi numbers in headers, outputting complete taxonomic classifications.
Specific Taxonomic Level
taxonomy.sh tree=tree.taxtree.gz level=genus in=species_list.txt
Extract only the genus-level classification for organisms listed in the input file.
Silva Database Processing
taxonomy.sh tree=tree.taxtree.gz silva=t in=silva_sequences.fasta
Process sequences with Silva-formatted taxonomic headers, using semicolon-delimited taxonomy strings.
Taxonomic Range Filtering
taxonomy.sh tree=tree.taxtree.gz minlevel=family maxlevel=kingdom in=organisms.txt
Show only taxonomic levels from family up to kingdom, excluding species and genus levels.
Algorithm Details
The taxonomy tool implements taxonomic classification using PrintTaxonomy.java with TaxTree data structures and multi-stage identifier resolution through specialized mapping classes.
Core Processing Pipeline
The PrintTaxonomy class orchestrates taxonomic lookup through these processing methods:
- processNames(): Direct command-line identifier processing with taxLevelExtended comparison
- processFile(): TextFile line-by-line processing with optional keyColumn extraction
- processReads(): ConcurrentReadInputStream processing for FASTA/FASTQ sequences
- parseNodeFromHeader(): Delegates to TaxTree.parseNodeFromHeader() for identifier extraction
TaxTree Data Structure Implementation
The TaxTree class maintains taxonomic hierarchy using array-based node storage with HashMap lookup acceleration:
- nodes array: Direct access to TaxNode objects by taxonomic ID
- nameMap HashMap: O(1) lookup from scientific names to node lists
- getNodesByNameExtended(): Returns List<TaxNode> for name-based queries
- levelExtended fields: Numeric level encoding for hierarchy traversal
Identifier Resolution Classes
Three specialized mapping systems handle different identifier types:
- GiToTaxid.initialize(): Loads integer array mapping from gi numbers to taxIDs
- AccessionToTaxid.load(): HashMap-based accession string to taxID mapping
- TaxTree.parseNodeFromHeader(): Pattern matching for embedded identifiers in sequence headers
- Direct taxID lookup: Array index access via tree.getNode(taxID)
Tree Traversal Algorithm
Lineage generation uses parent node traversal with level-based filtering:
- Parent traversal: while loop using tn=tree.getNode(tn.pid) until tn.id==tn.pid
- Level filtering: tn.levelExtended comparisons against minLevelExtended/maxLevelExtended
- Canonical filtering: tn.isSimple() method excludes non-standard taxonomic ranks
- Output formatting: TextStreamWriter with tab-delimited level/id/name structure
Output Generation Methods
The tool provides multiple output formatting approaches:
- printTaxonomy(): Full lineage with level/id/name tab-delimited format
- printTaxLevel(): Single taxonomic level extraction with level comparison
- makeTaxLine(): Semicolon-delimited format with level__name structure
- translateLine(): In-place column replacement for tab-delimited files
Memory Management and Performance
Resource usage characteristics based on data structure implementation:
- TaxTree loading: Complete NCBI nodes array requires 8-16GB RAM
- GiToTaxid mapping: Integer array storage adds 2-4GB memory overhead
- AccessionToTaxid HashMap: String key storage requires ~45GB for complete NCBI data
- Processing throughput: Limited by identifier resolution, not tree traversal
Thread Safety and Concurrency
The implementation uses thread-safe components for file processing:
- ConcurrentReadInputStream: Multi-threaded FASTA/FASTQ parsing
- TextStreamWriter: Thread-safe output stream management
- TaxTree immutability: Read-only tree structure supports concurrent access
- Shared mapping tables: GiToTaxid and AccessionToTaxid provide thread-safe lookup
Database Requirements
The taxonomy tool requires several database files that can be generated using BBTools utilities:
Required Files
- tree.taxtree.gz: Compressed taxonomic tree structure (generated with taxtree.sh)
- gitable.int1d.gz: GI number to taxID mapping (generated with gitable.sh, optional)
- Accession files: Accession to taxID mappings (optional, requires ~45GB RAM)
File Generation
Use the following BBTools utilities to create taxonomy database files:
# Generate taxonomic tree
taxtree.sh
# Generate GI table (if processing gi numbers)
gitable.sh
NERSC/Genepool Users
Users at NERSC can use the 'auto' option for database files, which automatically locates pre-built databases at /global/projectb/sandbox/gaag/bbtools/tax/
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Guide: Read bbtools/docs/guides/TaxonomyGuide.txt for detailed information