Start Protein Server VM

Script: startProteinServerVM.sh Source Directory: pipelines/server/ Author: Brian Bushnell

Startup script for the JGI protein sketch server with taxonomic classification capabilities. This script launches a high-performance taxonomic server specifically configured for protein sequence analysis and classification at the Joint Genome Institute.

Overview

The startProteinServerVM.sh script is a specialized deployment script designed to launch a protein sequence analysis server at JGI. It configures and starts a TaxServer instance optimized for amino acid sequence sketching and taxonomic classification of protein sequences.

This server provides remote access to protein sequence analysis capabilities, including k-mer based sketching, taxonomic classification, and similarity searching against prokaryotic protein databases. The server is specifically tuned for high-throughput protein analysis workflows used in genomics and metagenomics research.

Note: This script is designed for deployment on JGI infrastructure (specifically jgi-web-5) and contains hardcoded paths and configurations specific to that environment.

Prerequisites

System Requirements

Database Requirements

Infrastructure Requirements

Server Configuration

Service Parameters

Network Configuration

PORT=3074
TCP port number for the server to listen on. This port must be available and accessible through any firewalls.
DOMAIN=https://protein-sketch.jgi.doe.gov
The public domain name where the service will be accessible. This domain should have proper SSL/TLS certificates configured.
KILL=https://protein-sketch.jgi.doe.gov/kill/
Remote shutdown endpoint URL. Combined with the kill code, allows secure remote termination of the server.

Security Configuration

PASS=xxxxx
Security password for remote server management operations. This should be changed from the placeholder value before deployment.

Database Configuration

DB=ProkProt
Database identifier specifying the prokaryotic protein database to be used for sequence analysis and classification.
LOG=proteinlogVM_32.txt
Log file path for server output and error messages. The log captures all server activity and diagnostic information.

TaxServer Parameters

The script launches taxserver.sh with the following configuration:

JVM Configuration

-da
Disable Java assertions for production performance. Removes assertion checks to improve runtime speed.
-Xmx16g
Maximum heap size allocation of 16 gigabytes. This large memory allocation supports handling of large protein databases and concurrent user requests.

Server Behavior

prealloc=0.9
Pre-allocate 90% of available memory structures during startup. This improves performance by reducing allocation overhead during operation.
port=$PORT
Network port configuration (3074). Must match the PORT variable definition.
verbose
Enable verbose logging output for detailed operational monitoring and debugging.
tree=auto
Automatically locate and load the taxonomic tree structure from the standard JGI location.
sizemult=2
Size multiplier for internal data structures. A value of 2 provides extra capacity for handling peak loads.
sketchonly
Operate in sketch-only mode, focusing on k-mer based sequence sketching and comparison rather than full alignment.
index
Build or use indexed data structures for faster sequence lookups and comparisons.
amino
Configure the server specifically for amino acid (protein) sequence analysis rather than nucleotide sequences.

Security Parameters

domain=$DOMAIN
Domain name displayed in help messages and used for generating proper URLs in responses.
killcode=$PASS
Password required for remote server termination via the kill endpoint.
oldcode=$PASS
Password for terminating any previously running instance of the server.
oldaddress=$KILL
URL endpoint for terminating old server instances before starting the new one.

Analysis Parameters

k=12,9
Dual k-mer lengths for protein sequence analysis. Uses 12-mers for specificity and 9-mers for sensitivity, providing optimal balance for protein classification.

Service Endpoints

Once started, the server provides several HTTP endpoints for protein analysis:

Analysis Endpoints

Management Endpoints

Usage

Basic Startup

# 1. Ensure proper environment (run on jgi-web-5)
ssh user@jgi-web-5

# 2. Navigate to script location
cd /path/to/pipelines/server/

# 3. Update security password (IMPORTANT!)
vim startProteinServerVM.sh
# Change PASS=xxxxx to a secure password

# 4. Start the server
bash startProteinServerVM.sh

Verification

# Check if server started successfully
tail -f proteinlogVM_32.txt

# Test server response
curl https://protein-sketch.jgi.doe.gov/help

# Check if port is listening
netstat -tulpn | grep 3074

Client Usage Examples

# Submit protein sequence for analysis
curl -X POST https://protein-sketch.jgi.doe.gov/sketch \
  -H "Content-Type: text/plain" \
  -d ">protein1
MKLVLSLSLVLALLLPAALASQLNLQDPDFQQQWAFIGLCLTGAYLDSSSTFQNQGLNFQPLTQEVYRTQTNRKMEPFIPLTPETGAVSWEYGDSEQ"

# Compare sequences
sendsketch.sh in=proteins.fa address=protein-sketch.jgi.doe.gov

# Get taxonomic information
curl "https://protein-sketch.jgi.doe.gov/taxonomy?taxid=511145"

Process Management

Background Execution

The server runs in the background using nohup, which means:

Monitoring

# Monitor real-time logs
tail -f proteinlogVM_32.txt

# Check server process
ps aux | grep taxserver

# Monitor memory usage
top -p $(pgrep -f taxserver)

Shutdown

# Graceful shutdown (if kill code is known)
curl https://protein-sketch.jgi.doe.gov/kill/[your_password]

# Force shutdown
pkill -f "taxserver.sh.*ProkProt"

# Or by process ID
kill [PID_from_ps_command]

Performance Characteristics

Memory Usage

Scalability

Analysis Performance

Testing Mode

The script includes a commented testing configuration for development:

# Testing mode (uncomment and modify as needed):
# nohup /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/taxserver.sh -ea -Xmx28g port=$PORT verbose tree=auto sizemult=2 sketchonly $DB k=12,9 index=f

Testing Mode Differences

Security Considerations

Important Security Notes:
  • Change the default PASS value from "xxxxx" to a secure password before deployment
  • Ensure proper SSL/TLS configuration for the HTTPS domain
  • Restrict network access to the server port as needed
  • Monitor server logs for unauthorized access attempts
  • Regularly update the kill code password

Network Security

Troubleshooting

Common Issues

Server Won't Start
  • Check if port 3074 is already in use: netstat -tulpn | grep 3074
  • Verify Java is installed and accessible
  • Check log file for error messages: tail -f proteinlogVM_32.txt
  • Ensure sufficient memory is available (16GB+)
Database Loading Fails
  • Verify ProkProt database files are accessible
  • Check file permissions on database directory
  • Ensure taxonomic tree files are present
  • Verify network access to database resources
Connection Issues
  • Verify SSL certificate configuration for HTTPS domain
  • Check firewall rules for port 3074
  • Test local connection: curl localhost:3074/help
  • Verify DNS resolution for protein-sketch.jgi.doe.gov
Performance Issues
  • Monitor memory usage: top -p $(pgrep -f taxserver)
  • Check for memory swapping: vmstat 1
  • Review concurrent connection limits
  • Consider adjusting sizemult parameter

Log Analysis

The server logs all activity to proteinlogVM_32.txt. Key information includes:

Startup Messages

Runtime Activity

Log Monitoring Commands

# Real-time log monitoring
tail -f proteinlogVM_32.txt

# Search for specific events
grep "ERROR" proteinlogVM_32.txt
grep "connection" proteinlogVM_32.txt

# Monitor server startup
tail -f proteinlogVM_32.txt | grep -E "(Ready|Loading|Error)"

Integration

This server integrates with the broader BBTools ecosystem:

Related Tools

Client Integration

# Using sendsketch.sh with this server
sendsketch.sh in=proteins.fa address=protein-sketch.jgi.doe.gov

# Direct HTTP API usage
curl -X POST https://protein-sketch.jgi.doe.gov/sketch \
  -H "Content-Type: text/plain" \
  -d @protein_sequences.fa