TaxServer
HTTP server for NCBI taxonomy translation and sketch-based sequence identification. Maintains reference sketches and taxonomy databases in memory to provide high-performance remote queries for taxonomic classification and phylogenetic analysis.
Overview
TaxServer is one of the four core BBSketch programs (sketch.sh, comparesketch.sh, sendsketch.sh, taxserver.sh). Its primary purpose is to eliminate the overhead of repeatedly loading large reference datasets by maintaining them in memory as a persistent HTTP service. This enables rapid sketch-based taxonomic identification for remote clients using tools like SendSketch.
The server provides two main services:
- Taxonomy Translation: Convert between GI numbers, accession numbers, taxonomic names, and NCBI taxonomy IDs
- Sketch Comparison: Compare query sequences against pre-loaded reference sketch databases for species identification
With a large reference set and single queries, comparison time is dominated by loading the reference database. TaxServer solves this by keeping references in memory, making individual queries extremely fast.
Basic Usage
Taxonomy Server
taxserver.sh tree=tree.taxtree.gz table=gitable.int1d.gz port=1234
Start basic taxonomy translation server.
Sketch Server with Reference Database
taxserver.sh port=1234 tree.taxtree.gz gi=gitable.int1d.gz refseq*.sketch 1>log.o 2>&1 &
Load RefSeq sketches into memory and listen for sketch comparison queries on port 1234.
After starting the server, clients can query it using SendSketch:
sendsketch.sh in=assembly.fa address=http://localhost:1234/sketch
Parameters
Core Parameters
- tree=auto
- Path to taxonomy tree file (taxtree format). Always necessary. Use "auto" for default location at LBL.
- table=auto
- Path to GI table file (gitable format). Necessary for GI number support. Also accepts "gi=" parameter.
- accession=null
- Comma-delimited accession file paths. Example: prot.accession2taxid.gz,nucl_wgs.accession2taxid.gz
- img=null
- IMG dump file path for IMG genome database support.
- pattern=null
- Pattern file for compressed accession storage.
- port=3068
- HTTP server port number. Default 3068 for taxonomy server.
- domain=
- Domain name displayed in help messages. Default: taxonomy.jgi-psf.org
- dbname=
- Database name displayed in responses and help messages.
- taxpath=
- Base path for taxonomy files when using "auto" parameters.
Sketch Parameters
- sketchcomparethreads=16
- Maximum comparison threads per connection for sketch operations.
- sketchloadthreads=4
- Maximum load threads for local fastq file processing.
- sketchonly=f
- Run in sketch-only mode, disabling taxonomy name hashing.
- k=31
- Kmer length (1-32). Dual lengths supported for sensitivity: k=31,24
- prealloc=f
- Preallocate data structures. Use boolean (true/false) or fraction (0.75) for partial allocation.
Security Parameters
- killcode=
- Password for remote server shutdown via /kill/password endpoint.
- oldcode=
- Password of prior server instance for cleanup operations.
- oldaddress=
- Address to kill prior instance after initialization. Example: taxonomy.jgi-psf.org/kill/
- allowremotefileaccess=f
- Allow external queries to access server filesystem for local sketching.
- allowlocalhost=f
- Treat localhost queries as internal without proxy requirements.
- addressprefix=128.
- IP prefix for internal network identification. Default "128." for LBL.
Java Parameters
- -Xmx
- Maximum Java memory allocation. Examples: -Xmx20g (20GB), -Xmx200m (200MB). Maximum typically 85% of physical memory.
- -eoom
- Exit on out-of-memory exception. Requires Java 8u92+.
- -da
- Disable Java assertions for performance.
Setting Up Sketch Servers
Setting up your own sketch server involves three main steps: preparing taxonomy files, creating reference sketches, and starting the server. The BBTools package includes pipeline scripts demonstrating this process.
1. Prepare Taxonomy Files
# Download and prepare taxonomy files (see fetchTaxonomy.sh)
# This creates tree.taxtree.gz, gitable.int1d.gz, and accession files
2. Create Reference Sketches
# Sketch a reference database (see fetchNt.sh, fetchRefSeq.sh)
sketch.sh in=refseq.fa.gz out=refseq#.sketch files=31 mode=taxa \
tree=tree.taxtree.gz gi=gitable.int1d.gz taxlevel=subspecies
3. Start the Server
# Start server with taxonomy files and sketches
taxserver.sh -Xmx45g tree=tree.taxtree.gz gi=gitable.int1d.gz \
accession=*.accession2taxid.gz refseq*.sketch port=1234 \
domain=your.domain.org killcode=your_password
Important Configuration Note
When using custom taxonomy file locations (not at JGI), add the taxpath=X
parameter to all BBTools commands, where X is the path containing your taxonomy files. This applies to fetchNt.sh, startNtServer.sh, and all sketch operations.
Examples
Basic Taxonomy Server
taxserver.sh tree=tree.taxtree.gz table=gitable.int1d.gz port=1234
Start taxonomy translation server with GI number support.
Full-Featured Server
taxserver.sh -Xmx45g tree=tree.taxtree.gz table=gitable.int1d.gz \
accession=prot.accession2taxid.gz,nucl_wgs.accession2taxid.gz \
port=1234 refseq*.sketch
Server with taxonomy translation, accession support, and sketch database.
LBL Configuration
taxserver.sh tree=auto table=auto accession=auto port=1234
Use default LBL file locations with auto-detection.
Custom Path Setup
taxserver.sh -Xmx45g tree=auto table=auto accession=auto \
port=1234 taxpath=/custom/taxonomy/path refseq*.sketch
Auto-detection with custom taxonomy file directory.
Sketch-Only Mode
taxserver.sh tree=tree.taxtree.gz port=1234 sketchonly=t \
k=31,24 refseq*.sketch
Sketch comparison only, with dual kmer lengths for enhanced sensitivity.
High-Performance Configuration
taxserver.sh -Xmx128g tree=auto table=auto accession=auto \
port=1234 sketchcomparethreads=32 prealloc=0.8 \
refseq*.sketch nt*.sketch
High-memory server with increased threading and preallocation for large datasets.
Client Usage
Once your TaxServer is running, clients can query it using SendSketch or direct HTTP requests:
SendSketch Queries
# Query your local server
sendsketch.sh in=assembly.fa address=http://localhost:1234/sketch
# Query with additional parameters
sendsketch.sh in=reads.fq address=http://your.server.org:1234/sketch \
reads=1m samplerate=0.5 minkeycount=2
JGI Public Servers
# Use JGI's public servers (shorthand notation)
sendsketch.sh in=assembly.fa nt
sendsketch.sh in=assembly.fa refseq
sendsketch.sh in=assembly.fa silva
# Equivalent full addresses:
# https://nt-sketch.jgi-psf.org/sketch
# https://refseq-sketch.jgi-psf.org/sketch
# https://ribo-sketch.jgi-psf.org/sketch
Algorithm Details
Server Architecture
TaxServer implements a multi-threaded HTTP server using Java's HttpServer framework with four specialized handlers:
- TaxHandler: Processes taxonomy translation requests (/tax/, /stax/)
- SketchHandler: Handles sketch comparison queries (/sketch/)
- KillHandler: Manages secure server shutdown (/kill/)
- HelpHandler: Provides usage documentation (/help/, /usage/)
Request processing uses Executors.newFixedThreadPool() with configurable thread count (handlerThreads parameter, default max(2, CPU_cores)).
Memory-Resident Data Structures
The server loads and maintains several key data structures in memory:
- TaxTree: Complete NCBI taxonomy hierarchy with parent-child relationships and taxonomic levels
- GiToTaxid mapping: Hash table for rapid GI number to taxonomy ID resolution
- AccessionToTaxid tables: Distributed hash tables supporting protein and nucleotide accession lookups
- Reference sketches: Pre-computed MinHash sketches organized by taxonomy ID for O(1) lookup
- Name indexes: Optional taxonomic name hashing for species name resolution
Sketch Comparison Engine
The server's sketch functionality provides several comparison modes:
- Remote queries: Accept sketch data from SendSketch clients via HTTP POST
- Local file processing: Internal clients can specify local files for sketching (requires allowRemoteFileAccess=true)
- Reference comparison: Direct comparison against specific reference sketches by taxonomy ID
- Batch processing: Multiple sketches per request with individual result reporting
Thread management uses maxConcurrentSketchCompareThreads (default 16) for comparison operations and maxConcurrentSketchLoadThreads (default 4) for file I/O.
Performance Optimizations
- Preallocation strategy: The prealloc parameter (0-1 range) preallocates hash table capacity, reducing string hashing overhead during operation
- Memory compaction: clearMem parameter triggers System.gc() after data loading to optimize heap layout
- Distributed processing: Supports multi-server deployment with serverCount/serverNum parameters for horizontal scaling
- Connection pooling: Reuses HTTP connections and maintains persistent sketch data to minimize per-query overhead
Query Processing Pipeline
Each sketch query follows this processing sequence:
- Request parsing: Extract parameters and query mode from URL path
- Access control: Verify client permissions based on IP address and query type
- Sketch loading: Parse incoming sketch data or load from specified files
- Compatibility check: Verify kmer length and hash version match server configuration
- Database search: Compare query sketch(es) against reference database
- Result formatting: Generate response with similarity metrics and taxonomy information
API Endpoints
Taxonomy Translation
- /tax/name/<organism_name>
- Look up taxonomy by organism name (e.g., /tax/name/Escherichia%20coli)
- /tax/taxid/<taxonomy_id>
- Look up taxonomy by NCBI taxonomy ID (e.g., /tax/taxid/511145)
- /tax/gi/<gi_number>
- Look up taxonomy by GI number (e.g., /tax/gi/556503834)
- /tax/accession/<accession>
- Look up taxonomy by accession number (e.g., /tax/accession/NC_000913)
- /stax/
- Simple taxonomy queries returning only canonical taxonomic levels
Sketch Comparison
- /sketch/
- Submit sketch data for comparison against reference database (POST request)
- /sketch/file/<filename>
- Process local server files for sketching (internal clients only)
- /sketch/ref/<taxid>
- Compare against specific reference sketches by taxonomy ID
Server Management
- /help
- Display server usage documentation
- /usage
- Alias for /help endpoint
- /stats
- Server performance statistics and query metrics
- /kill/<password>
- Secure server shutdown (requires killcode parameter)
File Requirements
Essential Files
- Taxonomy tree: .taxtree.gz format containing NCBI taxonomy hierarchy
- GI table: .int1d.gz format mapping GI numbers to taxonomy IDs
Optional Enhancement Files
- Accession files: .accession2taxid.gz for protein/nucleotide accession support
- Reference sketches: Pre-computed .sketch files for sequence comparison
- Pattern file: Compressed accession patterns for memory efficiency
- Size file: Genome size annotations for taxonomy nodes
- IMG file: IMG genome database integration
File Preparation
Use the provided pipeline scripts for file preparation:
- fetchTaxonomy.sh: Download and format NCBI taxonomy files
- fetchNt.sh / fetchRefSeq.sh: Create reference sketch databases
- startNtServer.sh: Example server startup configuration
Performance Considerations
Memory Requirements
- Basic taxonomy server: 4-8GB for tree and GI table
- With accession support: 16-32GB additional for accession tables
- Large sketch databases: 64-128GB for complete RefSeq or nt sketches
- Recommended allocation: -Xmx with 85% of available physical memory
Threading Configuration
- handlerThreads: HTTP request handling (default: max(2, CPU_cores))
- sketchcomparethreads: Sketch comparison parallelism (default: 16)
- sketchloadthreads: File I/O operations (default: 4)
Startup Time
- Taxonomy files: 30-60 seconds for full NCBI taxonomy
- Large sketch databases: 5-15 minutes for RefSeq-scale datasets
- Accession tables: 10-30 minutes depending on size and preallocation
Troubleshooting
Common Issues
- Port binding errors
- Server waits up to 8 iterations with exponential backoff if port is busy
- Out of memory errors
- Increase -Xmx allocation or use prealloc parameter to optimize memory layout
- Slow sketch loading
- Verify sketch files are accessible and consider increasing sketchloadthreads
- Access denied for file queries
- Internal clients only: verify addressprefix matches client IP and allowRemoteFileAccess=true
Monitoring
- Check /stats endpoint for query counts and performance metrics
- Monitor server logs for client IPs and response times
- Use verbose=true for detailed operation logging
Integration with BBSketch Ecosystem
TaxServer is designed to work seamlessly with other BBSketch tools:
- sketch.sh: Creates reference sketches loaded by TaxServer
- comparesketch.sh: Direct comparison tool for offline analysis
- sendsketch.sh: Primary client for querying TaxServer instances
The typical workflow involves using sketch.sh to create reference databases, TaxServer to serve them, and sendsketch.sh for queries. This architecture enables high-throughput taxonomic identification with minimal per-query overhead.
Notes
- Unrecognized parameters without '=' are treated as sketch file paths
- Server supports both HTTP and HTTPS protocols
- Multiple server instances support distributed processing
- See BBSketchGuide.txt and TaxonomyGuide.txt for comprehensive usage examples
- JGI maintains public servers at nt-sketch.jgi-psf.org, refseq-sketch.jgi-psf.org, and ribo-sketch.jgi-psf.org
Support
For questions and support:
- Contact: bbushnell@lbl.gov
- Documentation: bbmap.org
- Latest version: SourceForge BBMap project
- Guides: BBTools/docs/guides/BBSketchGuide.txt and TaxonomyGuide.txt