Taxonomy

Basic Usage

taxonomy.sh tree=<tree file> <identifier>
taxonomy.sh tree=<tree file> in=<file>

The taxonomy tool accepts organism identifiers as command-line arguments or reads them from an input file. Identifiers can be gi numbers (gi|123456), NCBI taxIDs (9606), or Latin names (homo_sapiens).

Parameters

Parameters control input/output handling, taxonomy database files, output formatting, and filtering options. The tool requires taxonomy database files that can be generated using taxtree.sh and gitable.sh tools.

Processing parameters

in=<file>: A file containing named sequences, or just the names. Can be FASTA, FASTQ, or text format.
out=<file>: Output file. If blank, use stdout. Results will contain taxonomic classifications.
tree=<file>: Specify a TaxTree file like tree.taxtree.gz. On Genepool, use 'auto'. This file contains the taxonomic hierarchy structure.
gi=<file>: Specify a gitable file like gitable.int1d.gz. Only needed if gi numbers will be used. On Genepool, use 'auto'. Maps gi numbers to taxIDs.
accession=: Specify one or more comma-delimited NCBI accession to taxid files. Only needed if accessions will be used; requires ~45GB of memory. On Genepool, use 'auto'.
level=null: Set to a taxonomic level like phylum to just print that level. Options include: species, genus, family, order, class, phylum, kingdom, superkingdom.
minlevel=-1: For multi-level printing, do not print levels below this. Use -1 for no minimum restriction.
maxlevel=life: For multi-level printing, do not print levels above this. Default is 'life' which includes all levels.
silva=f: Parse headers using Silva or semicolon-delimited syntax. Set to true when working with Silva-formatted taxonomic strings.
taxpath=auto: Set the path to taxonomy files; auto only works at NERSC. Specify custom path for taxonomy database files.

Additional Processing Parameters

counts=<file>: Output file for taxonomic counts. Writes the number of sequences classified at each taxonomic node.
verbose=f: Print verbose status messages during processing. Helpful for debugging and monitoring progress.
table=<file>: Alias for gi parameter. Specify gitable file for gi number to taxID mapping.
printname=t: Print the query name before the taxonomy result. Useful for tracking which results correspond to which queries.
reverse=t: Reverse the order of taxonomic levels in output. When true, prints from specific to general (species to kingdom).
unite=f: Enable UNITE database mode for fungal taxonomy. Adjusts parsing for UNITE-specific formatting.
simple=f: Skip non-canonical taxonomic levels. Only prints standard taxonomic ranks (kingdom, phylum, class, order, family, genus, species).
column=-1: If set to a non-negative integer, parse the taxonomy information from this column in a tab-delimited file. Useful for processing structured data files.
name=<string>: Specify organism names directly as comma-delimited list. Alternative to providing names as separate arguments.
names=<string>: Alias for name parameter. Specify organism names directly as comma-delimited list.
id=<string>: Specify organism IDs directly as comma-delimited list. Can be taxIDs, gi numbers, or accessions.
ids=<string>: Alias for id parameter. Specify organism IDs directly as comma-delimited list.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Large taxonomy databases require substantial memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing hanging when taxonomy databases exceed available memory.
-da: Disable assertions. May provide minor performance improvement in production environments.

Examples

Basic Organism Lookup

taxonomy.sh tree=tree.taxtree.gz homo_sapiens canis_lupus 9606

Look up taxonomy for human (by Latin name), dog (by Latin name), and human (by NCBI taxID).

File-Based Processing

taxonomy.sh tree=tree.taxtree.gz gi=gitable.int1d.gz in=sequences.fasta out=taxonomy_results.txt

Process a FASTA file containing sequences with gi numbers in headers, outputting complete taxonomic classifications.

Specific Taxonomic Level

taxonomy.sh tree=tree.taxtree.gz level=genus in=species_list.txt

Extract only the genus-level classification for organisms listed in the input file.

Silva Database Processing

taxonomy.sh tree=tree.taxtree.gz silva=t in=silva_sequences.fasta

Process sequences with Silva-formatted taxonomic headers, using semicolon-delimited taxonomy strings.

Taxonomic Range Filtering

taxonomy.sh tree=tree.taxtree.gz minlevel=family maxlevel=kingdom in=organisms.txt

Show only taxonomic levels from family up to kingdom, excluding species and genus levels.

Algorithm Details

The taxonomy tool implements taxonomic classification using PrintTaxonomy.java with TaxTree data structures and multi-stage identifier resolution through specialized mapping classes.

Core Processing Pipeline

The PrintTaxonomy class orchestrates taxonomic lookup through these processing methods:

processNames(): Direct command-line identifier processing with taxLevelExtended comparison
processFile(): TextFile line-by-line processing with optional keyColumn extraction
processReads(): ConcurrentReadInputStream processing for FASTA/FASTQ sequences
parseNodeFromHeader(): Delegates to TaxTree.parseNodeFromHeader() for identifier extraction

TaxTree Data Structure Implementation

The TaxTree class maintains taxonomic hierarchy using array-based node storage with HashMap lookup acceleration:

nodes array: Direct access to TaxNode objects by taxonomic ID
nameMap HashMap: O(1) lookup from scientific names to node lists
getNodesByNameExtended(): Returns List<TaxNode> for name-based queries
levelExtended fields: Numeric level encoding for hierarchy traversal

Identifier Resolution Classes

Three specialized mapping systems handle different identifier types:

GiToTaxid.initialize(): Loads integer array mapping from gi numbers to taxIDs
AccessionToTaxid.load(): HashMap-based accession string to taxID mapping
TaxTree.parseNodeFromHeader(): Pattern matching for embedded identifiers in sequence headers
Direct taxID lookup: Array index access via tree.getNode(taxID)

Tree Traversal Algorithm

Lineage generation uses parent node traversal with level-based filtering:

Parent traversal: while loop using tn=tree.getNode(tn.pid) until tn.id==tn.pid
Level filtering: tn.levelExtended comparisons against minLevelExtended/maxLevelExtended
Canonical filtering: tn.isSimple() method excludes non-standard taxonomic ranks
Output formatting: TextStreamWriter with tab-delimited level/id/name structure

Output Generation Methods

The tool provides multiple output formatting approaches:

printTaxonomy(): Full lineage with level/id/name tab-delimited format
printTaxLevel(): Single taxonomic level extraction with level comparison
makeTaxLine(): Semicolon-delimited format with level__name structure
translateLine(): In-place column replacement for tab-delimited files

Memory Management and Performance

Resource usage characteristics based on data structure implementation:

TaxTree loading: Complete NCBI nodes array requires 8-16GB RAM
GiToTaxid mapping: Integer array storage adds 2-4GB memory overhead
AccessionToTaxid HashMap: String key storage requires ~45GB for complete NCBI data
Processing throughput: Limited by identifier resolution, not tree traversal

Thread Safety and Concurrency

The implementation uses thread-safe components for file processing:

ConcurrentReadInputStream: Multi-threaded FASTA/FASTQ parsing
TextStreamWriter: Thread-safe output stream management
TaxTree immutability: Read-only tree structure supports concurrent access
Shared mapping tables: GiToTaxid and AccessionToTaxid provide thread-safe lookup

Database Requirements

The taxonomy tool requires several database files that can be generated using BBTools utilities:

Required Files

tree.taxtree.gz: Compressed taxonomic tree structure (generated with taxtree.sh)
gitable.int1d.gz: GI number to taxID mapping (generated with gitable.sh, optional)
Accession files: Accession to taxID mappings (optional, requires ~45GB RAM)

File Generation

Use the following BBTools utilities to create taxonomy database files:

# Generate taxonomic tree
taxtree.sh

# Generate GI table (if processing gi numbers)
gitable.sh

NERSC/Genepool Users

Users at NERSC can use the 'auto' option for database files, which automatically locates pre-built databases at /global/projectb/sandbox/gaag/bbtools/tax/

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Guide: Read bbtools/docs/guides/TaxonomyGuide.txt for detailed information