RefSeq Protein Processing Pipeline
Specialized pipeline for processing RefSeq sequences by filtering prokaryotic clades, translating them to protein sequences, and creating comprehensive protein sketches for taxonomic analysis. This pipeline is designed for NERSC environments but can be adapted for other systems.
Overview
This pipeline performs comprehensive protein database processing from RefSeq nucleotide sequences. It filters prokaryotic sequences (Viruses, Bacteria, Archaea, plasmids), downloads additional RefSeq data for eukaryotic organelles, translates all sequences to proteins using gene calling, and creates taxonomically-organized protein sketches for rapid sequence comparison.
Prerequisites
System Requirements
- BBTools suite with taxonomy support
- Java 1.7 or higher
- Bash shell environment
- Internet connectivity for RefSeq downloads
- pigz for parallel compression (optional but recommended)
- Minimum 64GB RAM recommended for large-scale processing
Input Requirements
- sorted.fa.gz - Primary input file containing sorted sequences
- Taxonomy files - BBTools taxonomy database (tree.taxtree.gz, gi table files)
- Network access - For downloading RefSeq data via FTP
Environment Configuration
Set the TAXPATH variable to your taxonomy directory:
# For local installations, modify this line in the script:
TAXPATH="/path/to/your/taxonomy_directory/"
# The script uses "auto" for NERSC environments
Pipeline Stages
1. SLURM Job Configuration
The pipeline is configured as a SLURM job with the following specifications:
- Job Name: sketch_refseq
- Queue: genepool
- Account: gtrqc
- Runtime: 71 hours maximum
- Nodes: 1 exclusive Haswell node
- Error/Output logs: log_%j.err, log_%j.out
2. Prokaryotic Sequence Filtering
filterbytaxa.sh in=sorted.fa.gz out=prot/prok.fa.gz \
fastawrap=4095 ids=Viruses,Bacteria,Archaea,plasmids \
tree=auto -Xmx16g include taxpath=$TAXPATH zl=9 requirepresent=f
Filters the input sequences to include only prokaryotic clades:
- Viruses - All viral sequences
- Bacteria - All bacterial sequences
- Archaea - All archaeal sequences
- Plasmids - Plasmid sequences
Parameters used:
- fastawrap=4095 - Sets FASTA line wrapping to 4095 characters
- include - Include sequences matching the specified taxa
- requirepresent=f - Don't require all taxa to be present
- zl=9 - Compression level 9 for maximum compression
3. RefSeq Data Download and Processing
Downloads and processes additional RefSeq genomic data for eukaryotic organelles:
3.1 Protozoa Download
wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/protozoa/*genomic.fna.gz | \
gi2taxid.sh -Xmx1g in=stdin.fa.gz out=protozoa.fa.gz zl=9 server ow
3.2 Mitochondrial Genomes Download
wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/*genomic.fna.gz | \
gi2taxid.sh -Xmx1g in=stdin.fa.gz out=mito.fa.gz zl=9 server ow
3.3 Plastid Genomes Download
wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plastid/*genomic.fna.gz | \
gi2taxid.sh -Xmx1g in=stdin.fa.gz out=chloro.fa.gz zl=9 server ow
Each download stream is processed through gi2taxid.sh to:
- Parse GI numbers and map to taxonomic IDs
- Maintain taxonomic annotations during processing
- Compress output with maximum compression (zl=9)
- Overwrite existing files (ow flag)
- Use server mode for efficient processing
4. Gene Calling and Protein Translation
Translates all nucleotide sequences to amino acid sequences using gene calling:
4.1 Prokaryotic Gene Calling
callgenes.sh in=prok.fa.gz outa=prok.faa.gz -Xmx16g ow ordered=f zl=9
4.2 Protozoan Gene Calling
callgenes.sh in=protozoa.fa.gz outa=protozoa.faa.gz -Xmx16g ow ordered=f zl=9
4.3 Mitochondrial Gene Calling
callgenes.sh in=mito.fa.gz outa=mito.faa.gz -Xmx16g ow ordered=f zl=9
4.4 Plastid Gene Calling
callgenes.sh in=chloro.fa.gz outa=chloro.faa.gz -Xmx16g ow ordered=f zl=9
Gene calling parameters:
- outa - Output amino acid sequences (proteins)
- -Xmx16g - Allocate 16GB of memory
- ow - Overwrite existing output files
- ordered=f - Disable ordered processing for better performance
- zl=9 - Maximum compression level
5. Protein Sequence Consolidation
cat prok.faa.gz mito.faa.gz chloro.faa.gz protozoa.faa.gz > all.faa.gz
Combines all protein sequences into a single comprehensive file for downstream sketching operations.
6. Protein Sketch Generation with Blacklists
6.1 Family-Level Blacklist Creation
sketchblacklist.sh -Xmx63g in=prok.faa.gz prepasses=1 tree=auto taxa \
taxlevel=family ow out=blacklist_prokprot_family_40.sketch \
mincount=40 k=9,12 sizemult=3 amino taxpath=$TAXPATH
6.2 Genus-Level Blacklist Creation
sketchblacklist.sh -Xmx63g in=prok.faa.gz prepasses=1 tree=auto taxa \
taxlevel=genus ow out=blacklist_prokprot_genus_80.sketch \
mincount=80 k=9,12 sizemult=3 amino taxpath=$TAXPATH
Blacklist parameters:
- -Xmx63g - Allocate 63GB memory for large-scale processing
- prepasses=1 - Single preparation pass for efficiency
- taxlevel - Taxonomic level for grouping (family vs genus)
- mincount - Minimum occurrence count (40 for family, 80 for genus)
- k=9,12 - Dual k-mer lengths for amino acid sequences
- sizemult=3 - Size multiplier for sketch generation
- amino - Process as amino acid sequences
6.3 Blacklist Merging
mergesketch.sh -Xmx1g in=blacklist_prokprot_genus_80.sketch,blacklist_prokprot_family_40.sketch \
out=blacklist_prokprot_merged.sketch amino name0=blacklist_prokprot_merged k=9,12 ow
Combines the family and genus blacklists into a comprehensive merged blacklist for optimal k-mer filtering.
7. Final Protein Sketch Database Creation
bbsketch.sh -Xmx63g in=all.faa.gz out=taxa#.sketch mode=taxa tree=auto \
files=31 ow unpigz minsize=200 prefilter autosize \
blacklist=blacklist_prokprot_merged.sketch k=9,12 depth sizemult=3 \
amino taxpath=$TAXPATH
Creates the final taxonomically-organized protein sketch database:
- mode=taxa - Group sequences by taxonomic ID
- files=31 - Split output into 31 files for efficient loading
- minsize=200 - Minimum sequence size for inclusion
- prefilter - Enable prefiltering for large datasets
- autosize - Automatically determine optimal sketch sizes
- depth - Store k-mer depth information
- blacklist - Use merged blacklist to exclude common/uninformative k-mers
Algorithm and Design
Protein K-mer Analysis
This pipeline uses dual k-mer lengths (k=9,12) specifically optimized for amino acid sequences:
- k=9 - Shorter k-mers provide higher sensitivity for distantly related proteins
- k=12 - Longer k-mers provide higher specificity and reduce false positives
- Amino acid alphabet - 20-character alphabet vs 4-character nucleotide alphabet
Blacklist Strategy
The dual-level blacklist approach eliminates uninformative k-mers at two taxonomic levels:
- Family level (mincount=40) - Removes k-mers common to 40+ different families
- Genus level (mincount=80) - Removes k-mers common to 80+ different genera
- Merged approach - Combines both blacklists for comprehensive filtering
This strategy preserves taxonomically informative k-mers while removing highly conserved protein domains that would create spurious matches across distantly related organisms.
Gene Calling Integration
The pipeline uses CallGenes.java to identify and translate protein-coding sequences:
- Identifies open reading frames using multiple genetic codes
- Handles prokaryotic, mitochondrial, and plastid genetic codes appropriately
- Outputs amino acid sequences in FASTA format
- Maintains taxonomic annotations from input sequences
Data Processing Workflow
Input Data Flow
- sorted.fa.gz → Prokaryotic filtering → prot/prok.fa.gz
- RefSeq downloads → Taxonomic mapping → protozoa.fa.gz, mito.fa.gz, chloro.fa.gz
- All nucleotide files → Gene calling → Protein FASTA files (.faa.gz)
- All protein files → Concatenation → all.faa.gz
- all.faa.gz → Sketch creation → taxa#.sketch files
RefSeq Data Sources
The pipeline downloads data from these RefSeq release directories:
- protozoa/ - Single-celled eukaryotic organisms
- mitochondrion/ - Mitochondrial genomes from all organisms
- plastid/ - Chloroplast and other plastid genomes
Taxonomic Integration
Each downloaded stream is processed through gi2taxid.sh to maintain taxonomic context:
- Maps GI numbers to NCBI taxonomic identifiers
- Preserves taxonomic lineage information
- Enables species/genus/family level analysis
- Supports downstream taxonomic sketch organization
Memory and Performance Considerations
Memory Allocation
The pipeline uses tiered memory allocation based on processing intensity:
- 1GB - gi2taxid.sh operations (streaming processing)
- 16GB - filterbytaxa.sh and callgenes.sh operations
- 63GB - sketchblacklist.sh and bbsketch.sh (intensive k-mer processing)
Processing Time Estimates
Based on RefSeq database sizes, typical processing times:
- Prokaryotic filtering: 2-4 hours depending on input size
- RefSeq downloads: 1-3 hours depending on network speed
- Gene calling: 4-8 hours total for all datasets
- Sketch generation: 8-16 hours for comprehensive blacklist creation
- Total runtime: 15-30 hours for complete processing
Disk Space Requirements
- Input: Variable based on sorted.fa.gz size
- Downloads: 50-100GB for RefSeq organellar data
- Protein files: 20-40GB for translated sequences
- Sketch files: 1-5GB for final sketch database
- Working space: 200-300GB total recommended
Output Files
Primary Outputs
- prot/prok.fa.gz - Filtered prokaryotic nucleotide sequences
- prot/protozoa.fa.gz - Protozoan genomic sequences with taxonomy
- prot/mito.fa.gz - Mitochondrial genomic sequences
- prot/chloro.fa.gz - Plastid genomic sequences
Protein Sequence Files
- prot/prok.faa.gz - Prokaryotic protein sequences
- prot/protozoa.faa.gz - Protozoan protein sequences
- prot/mito.faa.gz - Mitochondrial protein sequences
- prot/chloro.faa.gz - Plastid protein sequences
- prot/all.faa.gz - Combined protein sequence database
Sketch Database Files
- prot/blacklist_prokprot_family_40.sketch - Family-level blacklist
- prot/blacklist_prokprot_genus_80.sketch - Genus-level blacklist
- prot/blacklist_prokprot_merged.sketch - Combined blacklist
- prot/taxa0.sketch through prot/taxa30.sketch - Final protein sketch database (31 files)
Usage of Output Sketches
The generated protein sketches can be used for:
- Protein sequence classification: comparesketch.sh in=query.faa prot/taxa*.sketch amino
- Taxonomic assignment: sendsketch.sh in=proteins.faa amino (with local server)
- Contamination detection: Identify non-target proteins in samples
- Phylogenetic analysis: Rapid protein-based relatedness assessment
Basic Usage
Standard Execution
# 1. Ensure sorted input file exists
ls -la sorted.fa.gz
# 2. Set taxonomy path (if not using NERSC)
export TAXPATH="/path/to/your/taxonomy"
# 3. Submit job (SLURM environment)
sbatch runRefSeqProtein.sh
# 4. Monitor progress
squeue -u $USER
tail -f log_*.out
Manual Execution (Non-SLURM)
# 1. Remove SLURM directives from script
# 2. Set taxonomy path
TAXPATH="/path/to/taxonomy"
# 3. Run directly
bash runRefSeqProtein.sh
Verification Commands
# Check output file sizes
ls -lh prot/
# Verify sketch database creation
ls -lh prot/taxa*.sketch
# Test sketch functionality
comparesketch.sh in=test.faa prot/taxa*.sketch amino
Customization Options
Taxonomic Filtering Modifications
Modify the filterbytaxa.sh command to include/exclude different clades:
# Include only specific bacteria
ids=Proteobacteria,Firmicutes,Actinobacteria
# Exclude specific groups
exclude ids=Viruses
# Include all prokaryotes plus fungi
ids=Bacteria,Archaea,Fungi
Memory Optimization
Adjust memory allocation based on available resources:
# Reduce memory for smaller systems
-Xmx8g # For callgenes operations
-Xmx32g # For sketch operations
# Increase for larger datasets
-Xmx128g # For very large RefSeq processing
K-mer Parameter Tuning
Modify k-mer settings for different analysis needs:
# Higher specificity
k=12,15
# Higher sensitivity
k=6,9
# Single k-mer length
k=12
Integration with BBSketch Ecosystem
Server Deployment
The generated protein sketches can be deployed as a sketch server:
taxserver.sh tree=$TAXPATH/tree.taxtree.gz prot/taxa*.sketch \
port=3456 amino 1>server.log 2>&1 &
Client Queries
Query the protein sketch database:
# Local comparison
comparesketch.sh in=query.faa prot/taxa*.sketch amino
# Server query (if running taxserver)
sendsketch.sh in=query.faa address=localhost:3456 amino
Complementary Nucleotide Processing
This pipeline complements nucleotide sketch databases by providing protein-level analysis:
- Enables protein domain-based classification
- Provides functional gene content analysis
- Supports cross-kingdom protein similarity searches
- Facilitates horizontal gene transfer detection
Troubleshooting
Common Issues
Network Download Failures
# Retry failed downloads manually
wget -c ftp://ftp.ncbi.nlm.nih.gov/refseq/release/protozoa/protozoa.1.genomic.fna.gz
wget -c ftp://ftp.ncbi.nlm.nih.gov/refseq/release/protozoa/protozoa.2.genomic.fna.gz
# ... etc for all files
Memory Errors
If sketch generation fails with out-of-memory errors:
- Add prefilter=2 to sketchblacklist.sh and bbsketch.sh commands
- Reduce sizemult from 3 to 2 or 1
- Process datasets in smaller chunks
- Increase available memory allocation
Taxonomy Path Issues
If taxonomy operations fail:
- Verify TAXPATH points to directory containing tree.taxtree.gz
- Check that gitable files are present and accessible
- Ensure taxonomy files are from compatible BBTools version
- Consider using absolute paths instead of "auto"
Notes and Best Practices
Pipeline Design Principles
- Comprehensive coverage: Includes major prokaryotic clades plus eukaryotic organelles
- Protein focus: Optimized for functional gene analysis rather than whole-genome comparison
- Blacklist optimization: Dual-level approach balances sensitivity and specificity
- Scalability: Designed for RefSeq-scale datasets (millions of sequences)
Adaptation Guidelines
- Modify taxonomic filters based on research focus
- Adjust k-mer parameters for target sequence divergence
- Scale memory allocation based on available resources
- Consider local vs remote sketch server deployment needs
Quality Control
- Monitor download completion for all RefSeq categories
- Verify gene calling produces expected protein counts
- Check sketch file sizes are reasonable (not empty or oversized)
- Test sketch functionality with known protein queries
Related Tools and Pipelines
Related BBTools
- filterbytaxa.sh - Taxonomic sequence filtering
- callgenes.sh - Gene calling and protein translation
- gi2taxid.sh - GI to taxonomy ID mapping
- sketchblacklist.sh - Blacklist sketch generation
- bbsketch.sh - Main sketch creation tool
- mergesketch.sh - Sketch file manipulation
Related Pipelines
- fetchRefSeq.sh - Downloads and processes RefSeq nucleotide data
- fetchNt.sh - Downloads and processes NCBI nt database
- fetchTaxonomy.sh - Downloads NCBI taxonomy files
- startNtServer.sh - Deploys nucleotide sketch server
Documentation References
- BBSketchGuide.txt - Comprehensive guide to sketch tools
- TaxonomyGuide.txt - NCBI taxonomy integration guide
- UsageGuide.txt - General BBTools usage patterns