Taxonomy

Script: taxonomy.sh Package: tax Class: PrintTaxonomy.java

Prints the full taxonomy of a string. String may be a gi number, NCBI taxID, or Latin name. An NCBI identifier should just be a number or ncbi|number. A gi number should be gi|number. Not: It is more convenient to use taxonomy.jgi-psf.org.

Basic Usage

taxonomy.sh tree=<tree file> <identifier>
taxonomy.sh tree=<tree file> in=<file>

The taxonomy tool accepts organism identifiers as command-line arguments or reads them from an input file. Identifiers can be gi numbers (gi|123456), NCBI taxIDs (9606), or Latin names (homo_sapiens).

Parameters

Parameters control input/output handling, taxonomy database files, output formatting, and filtering options. The tool requires taxonomy database files that can be generated using taxtree.sh and gitable.sh tools.

Processing parameters

in=<file>
A file containing named sequences, or just the names. Can be FASTA, FASTQ, or text format.
out=<file>
Output file. If blank, use stdout. Results will contain taxonomic classifications.
tree=<file>
Specify a TaxTree file like tree.taxtree.gz. On Genepool, use 'auto'. This file contains the taxonomic hierarchy structure.
gi=<file>
Specify a gitable file like gitable.int1d.gz. Only needed if gi numbers will be used. On Genepool, use 'auto'. Maps gi numbers to taxIDs.
accession=
Specify one or more comma-delimited NCBI accession to taxid files. Only needed if accessions will be used; requires ~45GB of memory. On Genepool, use 'auto'.
level=null
Set to a taxonomic level like phylum to just print that level. Options include: species, genus, family, order, class, phylum, kingdom, superkingdom.
minlevel=-1
For multi-level printing, do not print levels below this. Use -1 for no minimum restriction.
maxlevel=life
For multi-level printing, do not print levels above this. Default is 'life' which includes all levels.
silva=f
Parse headers using Silva or semicolon-delimited syntax. Set to true when working with Silva-formatted taxonomic strings.
taxpath=auto
Set the path to taxonomy files; auto only works at NERSC. Specify custom path for taxonomy database files.

Additional Processing Parameters

counts=<file>
Output file for taxonomic counts. Writes the number of sequences classified at each taxonomic node.
verbose=f
Print verbose status messages during processing. Helpful for debugging and monitoring progress.
table=<file>
Alias for gi parameter. Specify gitable file for gi number to taxID mapping.
printname=t
Print the query name before the taxonomy result. Useful for tracking which results correspond to which queries.
reverse=t
Reverse the order of taxonomic levels in output. When true, prints from specific to general (species to kingdom).
unite=f
Enable UNITE database mode for fungal taxonomy. Adjusts parsing for UNITE-specific formatting.
simple=f
Skip non-canonical taxonomic levels. Only prints standard taxonomic ranks (kingdom, phylum, class, order, family, genus, species).
column=-1
If set to a non-negative integer, parse the taxonomy information from this column in a tab-delimited file. Useful for processing structured data files.
name=<string>
Specify organism names directly as comma-delimited list. Alternative to providing names as separate arguments.
names=<string>
Alias for name parameter. Specify organism names directly as comma-delimited list.
id=<string>
Specify organism IDs directly as comma-delimited list. Can be taxIDs, gi numbers, or accessions.
ids=<string>
Alias for id parameter. Specify organism IDs directly as comma-delimited list.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Large taxonomy databases require substantial memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing hanging when taxonomy databases exceed available memory.
-da
Disable assertions. May provide minor performance improvement in production environments.

Examples

Basic Organism Lookup

taxonomy.sh tree=tree.taxtree.gz homo_sapiens canis_lupus 9606

Look up taxonomy for human (by Latin name), dog (by Latin name), and human (by NCBI taxID).

File-Based Processing

taxonomy.sh tree=tree.taxtree.gz gi=gitable.int1d.gz in=sequences.fasta out=taxonomy_results.txt

Process a FASTA file containing sequences with gi numbers in headers, outputting complete taxonomic classifications.

Specific Taxonomic Level

taxonomy.sh tree=tree.taxtree.gz level=genus in=species_list.txt

Extract only the genus-level classification for organisms listed in the input file.

Silva Database Processing

taxonomy.sh tree=tree.taxtree.gz silva=t in=silva_sequences.fasta

Process sequences with Silva-formatted taxonomic headers, using semicolon-delimited taxonomy strings.

Taxonomic Range Filtering

taxonomy.sh tree=tree.taxtree.gz minlevel=family maxlevel=kingdom in=organisms.txt

Show only taxonomic levels from family up to kingdom, excluding species and genus levels.

Algorithm Details

The taxonomy tool implements taxonomic classification using PrintTaxonomy.java with TaxTree data structures and multi-stage identifier resolution through specialized mapping classes.

Core Processing Pipeline

The PrintTaxonomy class orchestrates taxonomic lookup through these processing methods:

TaxTree Data Structure Implementation

The TaxTree class maintains taxonomic hierarchy using array-based node storage with HashMap lookup acceleration:

Identifier Resolution Classes

Three specialized mapping systems handle different identifier types:

Tree Traversal Algorithm

Lineage generation uses parent node traversal with level-based filtering:

Output Generation Methods

The tool provides multiple output formatting approaches:

Memory Management and Performance

Resource usage characteristics based on data structure implementation:

Thread Safety and Concurrency

The implementation uses thread-safe components for file processing:

Database Requirements

The taxonomy tool requires several database files that can be generated using BBTools utilities:

Required Files

File Generation

Use the following BBTools utilities to create taxonomy database files:

# Generate taxonomic tree
taxtree.sh

# Generate GI table (if processing gi numbers)
gitable.sh

NERSC/Genepool Users

Users at NERSC can use the 'auto' option for database files, which automatically locates pre-built databases at /global/projectb/sandbox/gaag/bbtools/tax/

Support

For questions and support: