Start Protein Server VM
Startup script for the JGI protein sketch server with taxonomic classification capabilities. This script launches a high-performance taxonomic server specifically configured for protein sequence analysis and classification at the Joint Genome Institute.
Overview
The startProteinServerVM.sh script is a specialized deployment script designed to launch a protein sequence analysis server at JGI. It configures and starts a TaxServer instance optimized for amino acid sequence sketching and taxonomic classification of protein sequences.
This server provides remote access to protein sequence analysis capabilities, including k-mer based sketching, taxonomic classification, and similarity searching against prokaryotic protein databases. The server is specifically tuned for high-throughput protein analysis workflows used in genomics and metagenomics research.
Prerequisites
System Requirements
- Linux environment with Bash shell
- Java Runtime Environment (JRE) or JDK
- BBTools suite installed with taxserver.sh
- Access to prokaryotic protein database (ProkProt)
- Sufficient memory allocation (16GB configured)
- Network port 3074 available
Database Requirements
- ProkProt Database: Prokaryotic protein sequence database
- Taxonomic Tree: NCBI taxonomy tree structure
- Sketch Index: Pre-computed k-mer sketches for protein sequences
Infrastructure Requirements
- JGI computing infrastructure access
- Proper SSL/TLS certificates for HTTPS domain
- Network configuration allowing external connections on port 3074
- Log file write permissions
Server Configuration
Service Parameters
Network Configuration
- PORT=3074
- TCP port number for the server to listen on. This port must be available and accessible through any firewalls.
- DOMAIN=https://protein-sketch.jgi.doe.gov
- The public domain name where the service will be accessible. This domain should have proper SSL/TLS certificates configured.
- KILL=https://protein-sketch.jgi.doe.gov/kill/
- Remote shutdown endpoint URL. Combined with the kill code, allows secure remote termination of the server.
Security Configuration
- PASS=xxxxx
- Security password for remote server management operations. This should be changed from the placeholder value before deployment.
Database Configuration
- DB=ProkProt
- Database identifier specifying the prokaryotic protein database to be used for sequence analysis and classification.
- LOG=proteinlogVM_32.txt
- Log file path for server output and error messages. The log captures all server activity and diagnostic information.
TaxServer Parameters
The script launches taxserver.sh with the following configuration:
JVM Configuration
- -da
- Disable Java assertions for production performance. Removes assertion checks to improve runtime speed.
- -Xmx16g
- Maximum heap size allocation of 16 gigabytes. This large memory allocation supports handling of large protein databases and concurrent user requests.
Server Behavior
- prealloc=0.9
- Pre-allocate 90% of available memory structures during startup. This improves performance by reducing allocation overhead during operation.
- port=$PORT
- Network port configuration (3074). Must match the PORT variable definition.
- verbose
- Enable verbose logging output for detailed operational monitoring and debugging.
- tree=auto
- Automatically locate and load the taxonomic tree structure from the standard JGI location.
- sizemult=2
- Size multiplier for internal data structures. A value of 2 provides extra capacity for handling peak loads.
- sketchonly
- Operate in sketch-only mode, focusing on k-mer based sequence sketching and comparison rather than full alignment.
- index
- Build or use indexed data structures for faster sequence lookups and comparisons.
- amino
- Configure the server specifically for amino acid (protein) sequence analysis rather than nucleotide sequences.
Security Parameters
- domain=$DOMAIN
- Domain name displayed in help messages and used for generating proper URLs in responses.
- killcode=$PASS
- Password required for remote server termination via the kill endpoint.
- oldcode=$PASS
- Password for terminating any previously running instance of the server.
- oldaddress=$KILL
- URL endpoint for terminating old server instances before starting the new one.
Analysis Parameters
- k=12,9
- Dual k-mer lengths for protein sequence analysis. Uses 12-mers for specificity and 9-mers for sensitivity, providing optimal balance for protein classification.
Service Endpoints
Once started, the server provides several HTTP endpoints for protein analysis:
Analysis Endpoints
- /sketch - Submit protein sequences for k-mer sketching and taxonomic classification
- /compare - Compare protein sequences or sketches against the database
- /taxonomy - Retrieve taxonomic information for specific taxa
- /help - Display API documentation and usage instructions
Management Endpoints
- /kill/[password] - Securely terminate the server (requires kill code)
- /status - Server health and performance metrics
Usage
Basic Startup
# 1. Ensure proper environment (run on jgi-web-5)
ssh user@jgi-web-5
# 2. Navigate to script location
cd /path/to/pipelines/server/
# 3. Update security password (IMPORTANT!)
vim startProteinServerVM.sh
# Change PASS=xxxxx to a secure password
# 4. Start the server
bash startProteinServerVM.sh
Verification
# Check if server started successfully
tail -f proteinlogVM_32.txt
# Test server response
curl https://protein-sketch.jgi.doe.gov/help
# Check if port is listening
netstat -tulpn | grep 3074
Client Usage Examples
# Submit protein sequence for analysis
curl -X POST https://protein-sketch.jgi.doe.gov/sketch \
-H "Content-Type: text/plain" \
-d ">protein1
MKLVLSLSLVLALLLPAALASQLNLQDPDFQQQWAFIGLCLTGAYLDSSSTFQNQGLNFQPLTQEVYRTQTNRKMEPFIPLTPETGAVSWEYGDSEQ"
# Compare sequences
sendsketch.sh in=proteins.fa address=protein-sketch.jgi.doe.gov
# Get taxonomic information
curl "https://protein-sketch.jgi.doe.gov/taxonomy?taxid=511145"
Process Management
Background Execution
The server runs in the background using nohup
, which means:
- The process continues running after SSH disconnection
- Output is redirected to the log file
- The server survives terminal session termination
Monitoring
# Monitor real-time logs
tail -f proteinlogVM_32.txt
# Check server process
ps aux | grep taxserver
# Monitor memory usage
top -p $(pgrep -f taxserver)
Shutdown
# Graceful shutdown (if kill code is known)
curl https://protein-sketch.jgi.doe.gov/kill/[your_password]
# Force shutdown
pkill -f "taxserver.sh.*ProkProt"
# Or by process ID
kill [PID_from_ps_command]
Performance Characteristics
Memory Usage
- Base Allocation: 16GB heap space for JVM
- Pre-allocation: 90% of structures allocated at startup
- Database Loading: ProkProt database loaded into memory for fast access
- Index Structures: K-mer indexes built for rapid sequence comparison
Scalability
- Concurrent Connections: Supports multiple simultaneous client connections
- Thread Pool: Multi-threaded request processing
- Memory Efficiency: Sketch-based approach reduces memory requirements vs. full alignment
- Size Multiplier: 2x capacity buffer for handling traffic spikes
Analysis Performance
- K-mer Strategy: Dual k-mer lengths (12,9) optimize speed vs. accuracy
- Amino Acid Mode: Specialized protein sequence handling
- Index-based Lookup: Pre-computed indexes accelerate searches
- Sketch Comparison: Fast approximate matching for large-scale analysis
Testing Mode
The script includes a commented testing configuration for development:
# Testing mode (uncomment and modify as needed):
# nohup /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/taxserver.sh -ea -Xmx28g port=$PORT verbose tree=auto sizemult=2 sketchonly $DB k=12,9 index=f
Testing Mode Differences
- -ea: Enable assertions for debugging (vs. -da in production)
- -Xmx28g: Higher memory allocation for development testing
- index=f: Disable indexing for faster startup during testing
- Reduced Security: No kill codes or domain restrictions
Security Considerations
- Change the default PASS value from "xxxxx" to a secure password before deployment
- Ensure proper SSL/TLS configuration for the HTTPS domain
- Restrict network access to the server port as needed
- Monitor server logs for unauthorized access attempts
- Regularly update the kill code password
Network Security
- HTTPS Only: Server configured for secure HTTP connections
- Port Restrictions: Only necessary port (3074) exposed
- Kill Code Protection: Remote shutdown requires authentication
- Domain Validation: Server validates expected domain configuration
Troubleshooting
Common Issues
- Check if port 3074 is already in use:
netstat -tulpn | grep 3074
- Verify Java is installed and accessible
- Check log file for error messages:
tail -f proteinlogVM_32.txt
- Ensure sufficient memory is available (16GB+)
- Verify ProkProt database files are accessible
- Check file permissions on database directory
- Ensure taxonomic tree files are present
- Verify network access to database resources
- Verify SSL certificate configuration for HTTPS domain
- Check firewall rules for port 3074
- Test local connection:
curl localhost:3074/help
- Verify DNS resolution for protein-sketch.jgi.doe.gov
- Monitor memory usage:
top -p $(pgrep -f taxserver)
- Check for memory swapping:
vmstat 1
- Review concurrent connection limits
- Consider adjusting sizemult parameter
Log Analysis
The server logs all activity to proteinlogVM_32.txt
. Key information includes:
Startup Messages
- JVM configuration and memory allocation
- Database loading progress and completion
- Network binding and port configuration
- Index building status
Runtime Activity
- Client connection and disconnection events
- Query processing statistics
- Error messages and stack traces
- Performance metrics and timing information
Log Monitoring Commands
# Real-time log monitoring
tail -f proteinlogVM_32.txt
# Search for specific events
grep "ERROR" proteinlogVM_32.txt
grep "connection" proteinlogVM_32.txt
# Monitor server startup
tail -f proteinlogVM_32.txt | grep -E "(Ready|Loading|Error)"
Integration
This server integrates with the broader BBTools ecosystem:
Related Tools
- sendsketch.sh: Client tool for submitting sequences to the server
- comparesketch.sh: Local sketch comparison tool
- sketch.sh: Local sketch generation tool
- taxserver.sh: The underlying server implementation
Client Integration
# Using sendsketch.sh with this server
sendsketch.sh in=proteins.fa address=protein-sketch.jgi.doe.gov
# Direct HTTP API usage
curl -X POST https://protein-sketch.jgi.doe.gov/sketch \
-H "Content-Type: text/plain" \
-d @protein_sequences.fa