RefSeq Protein Processing Pipeline

Script: runRefSeqProtein.sh Source Directory: pipelines/fetch/ Author: Brian Bushnell Last Updated: August 19, 2019

Specialized pipeline for processing RefSeq sequences by filtering prokaryotic clades, translating them to protein sequences, and creating comprehensive protein sketches for taxonomic analysis. This pipeline is designed for NERSC environments but can be adapted for other systems.

Overview

This pipeline performs comprehensive protein database processing from RefSeq nucleotide sequences. It filters prokaryotic sequences (Viruses, Bacteria, Archaea, plasmids), downloads additional RefSeq data for eukaryotic organelles, translates all sequences to proteins using gene calling, and creates taxonomically-organized protein sketches for rapid sequence comparison.

NERSC Configuration: This pipeline is configured for NERSC/Genepool with SLURM job scheduling. For use outside NERSC, modify the TAXPATH variable to point to your local taxonomy directory and remove SLURM directives.

Prerequisites

System Requirements

Input Requirements

Environment Configuration

Set the TAXPATH variable to your taxonomy directory:

# For local installations, modify this line in the script:
TAXPATH="/path/to/your/taxonomy_directory/"

# The script uses "auto" for NERSC environments

Pipeline Stages

1. SLURM Job Configuration

The pipeline is configured as a SLURM job with the following specifications:

2. Prokaryotic Sequence Filtering

filterbytaxa.sh in=sorted.fa.gz out=prot/prok.fa.gz \
    fastawrap=4095 ids=Viruses,Bacteria,Archaea,plasmids \
    tree=auto -Xmx16g include taxpath=$TAXPATH zl=9 requirepresent=f

Filters the input sequences to include only prokaryotic clades:

Parameters used:

3. RefSeq Data Download and Processing

Downloads and processes additional RefSeq genomic data for eukaryotic organelles:

3.1 Protozoa Download

wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/protozoa/*genomic.fna.gz | \
    gi2taxid.sh -Xmx1g in=stdin.fa.gz out=protozoa.fa.gz zl=9 server ow

3.2 Mitochondrial Genomes Download

wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/*genomic.fna.gz | \
    gi2taxid.sh -Xmx1g in=stdin.fa.gz out=mito.fa.gz zl=9 server ow

3.3 Plastid Genomes Download

wget -q -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plastid/*genomic.fna.gz | \
    gi2taxid.sh -Xmx1g in=stdin.fa.gz out=chloro.fa.gz zl=9 server ow

Each download stream is processed through gi2taxid.sh to:

4. Gene Calling and Protein Translation

Translates all nucleotide sequences to amino acid sequences using gene calling:

4.1 Prokaryotic Gene Calling

callgenes.sh in=prok.fa.gz outa=prok.faa.gz -Xmx16g ow ordered=f zl=9

4.2 Protozoan Gene Calling

callgenes.sh in=protozoa.fa.gz outa=protozoa.faa.gz -Xmx16g ow ordered=f zl=9

4.3 Mitochondrial Gene Calling

callgenes.sh in=mito.fa.gz outa=mito.faa.gz -Xmx16g ow ordered=f zl=9

4.4 Plastid Gene Calling

callgenes.sh in=chloro.fa.gz outa=chloro.faa.gz -Xmx16g ow ordered=f zl=9

Gene calling parameters:

5. Protein Sequence Consolidation

cat prok.faa.gz mito.faa.gz chloro.faa.gz protozoa.faa.gz > all.faa.gz

Combines all protein sequences into a single comprehensive file for downstream sketching operations.

6. Protein Sketch Generation with Blacklists

6.1 Family-Level Blacklist Creation

sketchblacklist.sh -Xmx63g in=prok.faa.gz prepasses=1 tree=auto taxa \
    taxlevel=family ow out=blacklist_prokprot_family_40.sketch \
    mincount=40 k=9,12 sizemult=3 amino taxpath=$TAXPATH

6.2 Genus-Level Blacklist Creation

sketchblacklist.sh -Xmx63g in=prok.faa.gz prepasses=1 tree=auto taxa \
    taxlevel=genus ow out=blacklist_prokprot_genus_80.sketch \
    mincount=80 k=9,12 sizemult=3 amino taxpath=$TAXPATH

Blacklist parameters:

6.3 Blacklist Merging

mergesketch.sh -Xmx1g in=blacklist_prokprot_genus_80.sketch,blacklist_prokprot_family_40.sketch \
    out=blacklist_prokprot_merged.sketch amino name0=blacklist_prokprot_merged k=9,12 ow

Combines the family and genus blacklists into a comprehensive merged blacklist for optimal k-mer filtering.

7. Final Protein Sketch Database Creation

bbsketch.sh -Xmx63g in=all.faa.gz out=taxa#.sketch mode=taxa tree=auto \
    files=31 ow unpigz minsize=200 prefilter autosize \
    blacklist=blacklist_prokprot_merged.sketch k=9,12 depth sizemult=3 \
    amino taxpath=$TAXPATH

Creates the final taxonomically-organized protein sketch database:

Algorithm and Design

Protein K-mer Analysis

This pipeline uses dual k-mer lengths (k=9,12) specifically optimized for amino acid sequences:

Blacklist Strategy

The dual-level blacklist approach eliminates uninformative k-mers at two taxonomic levels:

  1. Family level (mincount=40) - Removes k-mers common to 40+ different families
  2. Genus level (mincount=80) - Removes k-mers common to 80+ different genera
  3. Merged approach - Combines both blacklists for comprehensive filtering

This strategy preserves taxonomically informative k-mers while removing highly conserved protein domains that would create spurious matches across distantly related organisms.

Gene Calling Integration

The pipeline uses CallGenes.java to identify and translate protein-coding sequences:

Data Processing Workflow

Input Data Flow

  1. sorted.fa.gz → Prokaryotic filtering → prot/prok.fa.gz
  2. RefSeq downloads → Taxonomic mapping → protozoa.fa.gz, mito.fa.gz, chloro.fa.gz
  3. All nucleotide files → Gene calling → Protein FASTA files (.faa.gz)
  4. All protein files → Concatenation → all.faa.gz
  5. all.faa.gz → Sketch creation → taxa#.sketch files

RefSeq Data Sources

The pipeline downloads data from these RefSeq release directories:

Taxonomic Integration

Each downloaded stream is processed through gi2taxid.sh to maintain taxonomic context:

Memory and Performance Considerations

Memory Allocation

The pipeline uses tiered memory allocation based on processing intensity:

Processing Time Estimates

Based on RefSeq database sizes, typical processing times:

Disk Space Requirements

Output Files

Primary Outputs

Protein Sequence Files

Sketch Database Files

Usage of Output Sketches

The generated protein sketches can be used for:

Basic Usage

Standard Execution

# 1. Ensure sorted input file exists
ls -la sorted.fa.gz

# 2. Set taxonomy path (if not using NERSC)
export TAXPATH="/path/to/your/taxonomy"

# 3. Submit job (SLURM environment)
sbatch runRefSeqProtein.sh

# 4. Monitor progress
squeue -u $USER
tail -f log_*.out

Manual Execution (Non-SLURM)

# 1. Remove SLURM directives from script
# 2. Set taxonomy path
TAXPATH="/path/to/taxonomy"

# 3. Run directly
bash runRefSeqProtein.sh

Verification Commands

# Check output file sizes
ls -lh prot/

# Verify sketch database creation
ls -lh prot/taxa*.sketch

# Test sketch functionality
comparesketch.sh in=test.faa prot/taxa*.sketch amino

Customization Options

Taxonomic Filtering Modifications

Modify the filterbytaxa.sh command to include/exclude different clades:

# Include only specific bacteria
ids=Proteobacteria,Firmicutes,Actinobacteria

# Exclude specific groups
exclude ids=Viruses

# Include all prokaryotes plus fungi
ids=Bacteria,Archaea,Fungi

Memory Optimization

Adjust memory allocation based on available resources:

# Reduce memory for smaller systems
-Xmx8g    # For callgenes operations
-Xmx32g   # For sketch operations

# Increase for larger datasets
-Xmx128g  # For very large RefSeq processing

K-mer Parameter Tuning

Modify k-mer settings for different analysis needs:

# Higher specificity
k=12,15

# Higher sensitivity  
k=6,9

# Single k-mer length
k=12

Integration with BBSketch Ecosystem

Server Deployment

The generated protein sketches can be deployed as a sketch server:

taxserver.sh tree=$TAXPATH/tree.taxtree.gz prot/taxa*.sketch \
    port=3456 amino 1>server.log 2>&1 &

Client Queries

Query the protein sketch database:

# Local comparison
comparesketch.sh in=query.faa prot/taxa*.sketch amino

# Server query (if running taxserver)
sendsketch.sh in=query.faa address=localhost:3456 amino

Complementary Nucleotide Processing

This pipeline complements nucleotide sketch databases by providing protein-level analysis:

Troubleshooting

Common Issues

Network Download Failures

# Retry failed downloads manually
wget -c ftp://ftp.ncbi.nlm.nih.gov/refseq/release/protozoa/protozoa.1.genomic.fna.gz
wget -c ftp://ftp.ncbi.nlm.nih.gov/refseq/release/protozoa/protozoa.2.genomic.fna.gz
# ... etc for all files

Memory Errors

If sketch generation fails with out-of-memory errors:

Taxonomy Path Issues

If taxonomy operations fail:

Notes and Best Practices

Pipeline Design Principles

Adaptation Guidelines

Quality Control

Related Tools and Pipelines

Related BBTools

Related Pipelines

Documentation References