RefSeq Protein Processing Pipeline

Overview

This pipeline performs comprehensive protein database processing from RefSeq nucleotide sequences. It filters prokaryotic sequences (Viruses, Bacteria, Archaea, plasmids), downloads additional RefSeq data for eukaryotic organelles, translates all sequences to proteins using gene calling, and creates taxonomically-organized protein sketches for rapid sequence comparison.

NERSC Configuration: This pipeline is configured for NERSC/Genepool with SLURM job scheduling. For use outside NERSC, modify the TAXPATH variable to point to your local taxonomy directory and remove SLURM directives.

Prerequisites

System Requirements

BBTools suite with taxonomy support
Java 1.7 or higher
Bash shell environment
Internet connectivity for RefSeq downloads
pigz for parallel compression (optional but recommended)
Minimum 64GB RAM recommended for large-scale processing

Input Requirements

sorted.fa.gz - Primary input file containing sorted sequences
Taxonomy files - BBTools taxonomy database (tree.taxtree.gz, gi table files)
Network access - For downloading RefSeq data via FTP

Environment Configuration

Set the TAXPATH variable to your taxonomy directory:

# For local installations, modify this line in the script:
TAXPATH="/path/to/your/taxonomy_directory/"

# The script uses "auto" for NERSC environments

Pipeline Stages

1. SLURM Job Configuration

The pipeline is configured as a SLURM job with the following specifications:

Job Name: sketch_refseq
Queue: genepool
Account: gtrqc
Runtime: 71 hours maximum
Nodes: 1 exclusive Haswell node
Error/Output logs: log_%j.err, log_%j.out

2. Prokaryotic Sequence Filtering

filterbytaxa.sh in=sorted.fa.gz out=prot/prok.fa.gz \
    fastawrap=4095 ids=Viruses,Bacteria,Archaea,plasmids \
    tree=auto -Xmx16g include taxpath=$TAXPATH zl=9 requirepresent=f

Filters the input sequences to include only prokaryotic clades:

Viruses - All viral sequences
Bacteria - All bacterial sequences
Archaea - All archaeal sequences
Plasmids - Plasmid sequences

Parameters used:

fastawrap=4095 - Sets FASTA line wrapping to 4095 characters
include - Include sequences matching the specified taxa
requirepresent=f - Don't require all taxa to be present
zl=9 - Compression level 9 for maximum compression

3. RefSeq Data Download and Processing

Downloads and processes additional RefSeq genomic data for eukaryotic organelles:

3.1 Protozoa Download

wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/protozoa/*genomic.fna.gz | \
    gi2taxid.sh -Xmx1g in=stdin.fa.gz out=protozoa.fa.gz zl=9 server ow

3.2 Mitochondrial Genomes Download

wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/*genomic.fna.gz | \
    gi2taxid.sh -Xmx1g in=stdin.fa.gz out=mito.fa.gz zl=9 server ow

3.3 Plastid Genomes Download

wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plastid/*genomic.fna.gz | \
    gi2taxid.sh -Xmx1g in=stdin.fa.gz out=chloro.fa.gz zl=9 server ow

Each download stream is processed through gi2taxid.sh to:

Parse GI numbers and map to taxonomic IDs
Maintain taxonomic annotations during processing
Compress output with maximum compression (zl=9)
Overwrite existing files (ow flag)
Use server mode for efficient processing

4. Gene Calling and Protein Translation

Translates all nucleotide sequences to amino acid sequences using gene calling:

4.1 Prokaryotic Gene Calling

callgenes.sh in=prok.fa.gz outa=prok.faa.gz -Xmx16g ow ordered=f zl=9

4.2 Protozoan Gene Calling

callgenes.sh in=protozoa.fa.gz outa=protozoa.faa.gz -Xmx16g ow ordered=f zl=9

4.3 Mitochondrial Gene Calling

callgenes.sh in=mito.fa.gz outa=mito.faa.gz -Xmx16g ow ordered=f zl=9

4.4 Plastid Gene Calling

callgenes.sh in=chloro.fa.gz outa=chloro.faa.gz -Xmx16g ow ordered=f zl=9

Gene calling parameters:

outa - Output amino acid sequences (proteins)
-Xmx16g - Allocate 16GB of memory
ow - Overwrite existing output files
ordered=f - Disable ordered processing for better performance
zl=9 - Maximum compression level

5. Protein Sequence Consolidation

cat prok.faa.gz mito.faa.gz chloro.faa.gz protozoa.faa.gz > all.faa.gz

Combines all protein sequences into a single comprehensive file for downstream sketching operations.

6. Protein Sketch Generation with Blacklists

6.1 Family-Level Blacklist Creation

sketchblacklist.sh -Xmx63g in=prok.faa.gz prepasses=1 tree=auto taxa \
    taxlevel=family ow out=blacklist_prokprot_family_40.sketch \
    mincount=40 k=9,12 sizemult=3 amino taxpath=$TAXPATH

6.2 Genus-Level Blacklist Creation

sketchblacklist.sh -Xmx63g in=prok.faa.gz prepasses=1 tree=auto taxa \
    taxlevel=genus ow out=blacklist_prokprot_genus_80.sketch \
    mincount=80 k=9,12 sizemult=3 amino taxpath=$TAXPATH

Blacklist parameters:

-Xmx63g - Allocate 63GB memory for large-scale processing
prepasses=1 - Single preparation pass for efficiency
taxlevel - Taxonomic level for grouping (family vs genus)
mincount - Minimum occurrence count (40 for family, 80 for genus)
k=9,12 - Dual k-mer lengths for amino acid sequences
sizemult=3 - Size multiplier for sketch generation
amino - Process as amino acid sequences

6.3 Blacklist Merging

mergesketch.sh -Xmx1g in=blacklist_prokprot_genus_80.sketch,blacklist_prokprot_family_40.sketch \
    out=blacklist_prokprot_merged.sketch amino name0=blacklist_prokprot_merged k=9,12 ow

Combines the family and genus blacklists into a comprehensive merged blacklist for optimal k-mer filtering.

7. Final Protein Sketch Database Creation

bbsketch.sh -Xmx63g in=all.faa.gz out=taxa#.sketch mode=taxa tree=auto \
    files=31 ow unpigz minsize=200 prefilter autosize \
    blacklist=blacklist_prokprot_merged.sketch k=9,12 depth sizemult=3 \
    amino taxpath=$TAXPATH

Creates the final taxonomically-organized protein sketch database:

mode=taxa - Group sequences by taxonomic ID
files=31 - Split output into 31 files for efficient loading
minsize=200 - Minimum sequence size for inclusion
prefilter - Enable prefiltering for large datasets
autosize - Automatically determine optimal sketch sizes
depth - Store k-mer depth information
blacklist - Use merged blacklist to exclude common/uninformative k-mers

Algorithm and Design

Protein K-mer Analysis

This pipeline uses dual k-mer lengths (k=9,12) specifically optimized for amino acid sequences:

k=9 - Shorter k-mers provide higher sensitivity for distantly related proteins
k=12 - Longer k-mers provide higher specificity and reduce false positives
Amino acid alphabet - 20-character alphabet vs 4-character nucleotide alphabet

Blacklist Strategy

The dual-level blacklist approach eliminates uninformative k-mers at two taxonomic levels:

Family level (mincount=40) - Removes k-mers common to 40+ different families
Genus level (mincount=80) - Removes k-mers common to 80+ different genera
Merged approach - Combines both blacklists for comprehensive filtering

This strategy preserves taxonomically informative k-mers while removing highly conserved protein domains that would create spurious matches across distantly related organisms.

Gene Calling Integration

The pipeline uses CallGenes.java to identify and translate protein-coding sequences:

Identifies open reading frames using multiple genetic codes
Handles prokaryotic, mitochondrial, and plastid genetic codes appropriately
Outputs amino acid sequences in FASTA format
Maintains taxonomic annotations from input sequences

Data Processing Workflow

Input Data Flow

sorted.fa.gz → Prokaryotic filtering → prot/prok.fa.gz
RefSeq downloads → Taxonomic mapping → protozoa.fa.gz, mito.fa.gz, chloro.fa.gz
All nucleotide files → Gene calling → Protein FASTA files (.faa.gz)
All protein files → Concatenation → all.faa.gz
all.faa.gz → Sketch creation → taxa#.sketch files

RefSeq Data Sources

The pipeline downloads data from these RefSeq release directories:

protozoa/ - Single-celled eukaryotic organisms
mitochondrion/ - Mitochondrial genomes from all organisms
plastid/ - Chloroplast and other plastid genomes

Taxonomic Integration

Each downloaded stream is processed through gi2taxid.sh to maintain taxonomic context:

Maps GI numbers to NCBI taxonomic identifiers
Preserves taxonomic lineage information
Enables species/genus/family level analysis
Supports downstream taxonomic sketch organization

Memory and Performance Considerations

Memory Allocation

The pipeline uses tiered memory allocation based on processing intensity:

1GB - gi2taxid.sh operations (streaming processing)
16GB - filterbytaxa.sh and callgenes.sh operations
63GB - sketchblacklist.sh and bbsketch.sh (intensive k-mer processing)

Processing Time Estimates

Based on RefSeq database sizes, typical processing times:

Prokaryotic filtering: 2-4 hours depending on input size
RefSeq downloads: 1-3 hours depending on network speed
Gene calling: 4-8 hours total for all datasets
Sketch generation: 8-16 hours for comprehensive blacklist creation
Total runtime: 15-30 hours for complete processing

Disk Space Requirements

Input: Variable based on sorted.fa.gz size
Downloads: 50-100GB for RefSeq organellar data
Protein files: 20-40GB for translated sequences
Sketch files: 1-5GB for final sketch database
Working space: 200-300GB total recommended

Output Files

Primary Outputs

prot/prok.fa.gz - Filtered prokaryotic nucleotide sequences
prot/protozoa.fa.gz - Protozoan genomic sequences with taxonomy
prot/mito.fa.gz - Mitochondrial genomic sequences
prot/chloro.fa.gz - Plastid genomic sequences

Protein Sequence Files

prot/prok.faa.gz - Prokaryotic protein sequences
prot/protozoa.faa.gz - Protozoan protein sequences
prot/mito.faa.gz - Mitochondrial protein sequences
prot/chloro.faa.gz - Plastid protein sequences
prot/all.faa.gz - Combined protein sequence database

Sketch Database Files

prot/blacklist_prokprot_family_40.sketch - Family-level blacklist
prot/blacklist_prokprot_genus_80.sketch - Genus-level blacklist
prot/blacklist_prokprot_merged.sketch - Combined blacklist
prot/taxa0.sketch through prot/taxa30.sketch - Final protein sketch database (31 files)

Usage of Output Sketches

The generated protein sketches can be used for:

Protein sequence classification: comparesketch.sh in=query.faa prot/taxa*.sketch amino
Taxonomic assignment: sendsketch.sh in=proteins.faa amino (with local server)
Contamination detection: Identify non-target proteins in samples
Phylogenetic analysis: Rapid protein-based relatedness assessment

Basic Usage

Standard Execution

# 1. Ensure sorted input file exists
ls -la sorted.fa.gz

# 2. Set taxonomy path (if not using NERSC)
export TAXPATH="/path/to/your/taxonomy"

# 3. Submit job (SLURM environment)
sbatch runRefSeqProtein.sh

# 4. Monitor progress
squeue -u $USER
tail -f log_*.out

Manual Execution (Non-SLURM)

# 1. Remove SLURM directives from script
# 2. Set taxonomy path
TAXPATH="/path/to/taxonomy"

# 3. Run directly
bash runRefSeqProtein.sh

Verification Commands

# Check output file sizes
ls -lh prot/

# Verify sketch database creation
ls -lh prot/taxa*.sketch

# Test sketch functionality
comparesketch.sh in=test.faa prot/taxa*.sketch amino

Customization Options

Taxonomic Filtering Modifications

Modify the filterbytaxa.sh command to include/exclude different clades:

# Include only specific bacteria
ids=Proteobacteria,Firmicutes,Actinobacteria

# Exclude specific groups
exclude ids=Viruses

# Include all prokaryotes plus fungi
ids=Bacteria,Archaea,Fungi

Memory Optimization

Adjust memory allocation based on available resources:

# Reduce memory for smaller systems
-Xmx8g    # For callgenes operations
-Xmx32g   # For sketch operations

# Increase for larger datasets
-Xmx128g  # For very large RefSeq processing

K-mer Parameter Tuning

Modify k-mer settings for different analysis needs:

# Higher specificity
k=12,15

# Higher sensitivity  
k=6,9

# Single k-mer length
k=12

Integration with BBSketch Ecosystem

Server Deployment

The generated protein sketches can be deployed as a sketch server:

taxserver.sh tree=$TAXPATH/tree.taxtree.gz prot/taxa*.sketch \
    port=3456 amino 1>server.log 2>&1 &

Client Queries

Query the protein sketch database:

# Local comparison
comparesketch.sh in=query.faa prot/taxa*.sketch amino

# Server query (if running taxserver)
sendsketch.sh in=query.faa address=localhost:3456 amino

Complementary Nucleotide Processing

This pipeline complements nucleotide sketch databases by providing protein-level analysis:

Enables protein domain-based classification
Provides functional gene content analysis
Supports cross-kingdom protein similarity searches
Facilitates horizontal gene transfer detection

Troubleshooting

Common Issues

Network Download Failures

# Retry failed downloads manually
wget -c ftp://ftp.ncbi.nlm.nih.gov/refseq/release/protozoa/protozoa.1.genomic.fna.gz
wget -c ftp://ftp.ncbi.nlm.nih.gov/refseq/release/protozoa/protozoa.2.genomic.fna.gz
# ... etc for all files

Memory Errors

If sketch generation fails with out-of-memory errors:

Add prefilter=2 to sketchblacklist.sh and bbsketch.sh commands
Reduce sizemult from 3 to 2 or 1
Process datasets in smaller chunks
Increase available memory allocation

Taxonomy Path Issues

If taxonomy operations fail:

Verify TAXPATH points to directory containing tree.taxtree.gz
Check that gitable files are present and accessible
Ensure taxonomy files are from compatible BBTools version
Consider using absolute paths instead of "auto"

Notes and Best Practices

Pipeline Design Principles

Comprehensive coverage: Includes major prokaryotic clades plus eukaryotic organelles
Protein focus: Optimized for functional gene analysis rather than whole-genome comparison
Blacklist optimization: Dual-level approach balances sensitivity and specificity
Scalability: Designed for RefSeq-scale datasets (millions of sequences)

Adaptation Guidelines

Modify taxonomic filters based on research focus
Adjust k-mer parameters for target sequence divergence
Scale memory allocation based on available resources
Consider local vs remote sketch server deployment needs

Quality Control

Monitor download completion for all RefSeq categories
Verify gene calling produces expected protein counts
Check sketch file sizes are reasonable (not empty or oversized)
Test sketch functionality with known protein queries

Related Tools and Pipelines

Related BBTools

filterbytaxa.sh - Taxonomic sequence filtering
callgenes.sh - Gene calling and protein translation
gi2taxid.sh - GI to taxonomy ID mapping
sketchblacklist.sh - Blacklist sketch generation
bbsketch.sh - Main sketch creation tool
mergesketch.sh - Sketch file manipulation

Related Pipelines

fetchRefSeq.sh - Downloads and processes RefSeq nucleotide data
fetchNt.sh - Downloads and processes NCBI nt database
fetchTaxonomy.sh - Downloads NCBI taxonomy files
startNtServer.sh - Deploys nucleotide sketch server

Documentation References

BBSketchGuide.txt - Comprehensive guide to sketch tools
TaxonomyGuide.txt - NCBI taxonomy integration guide
UsageGuide.txt - General BBTools usage patterns