RefSeq Sketch Generation Pipeline

Script: sketchRefSeq.sh Source Directory: pipelines/fetch/ Author: Brian Bushnell Last Updated: August 7, 2019

A comprehensive pipeline for generating taxonomic sketches from RefSeq genomic data. This script processes RefSeq sequences through taxonomic sorting, k-mer blacklist generation, and creates optimized sketch databases for rapid taxonomic classification and similarity analysis.

Overview

This pipeline transforms RefSeq genomic data into efficient sketch-based reference databases for taxonomic analysis. The process includes taxonomic sorting for memory optimization, k-mer blacklist generation to remove uninformative sequences, and creation of taxonomically-organized sketch files that enable rapid sequence comparison and classification.

Important: Ensure taxonomy data is updated before running this pipeline. The script is designed for NERSC systems but can be adapted for other environments by modifying the TAXPATH variable to point to your BBTools taxonomy directory.

Prerequisites

System Requirements

Input Requirements

Environment Configuration

TAXPATH="auto"

The pipeline uses automatic taxonomy path detection. For custom installations, modify this to your taxonomy directory path (e.g., TAXPATH="/path/to/taxonomy_directory/").

Pipeline Stages

Stage 1: Taxonomic Sorting

sortbyname.sh -Xmx100g in=renamed.fa.gz memmult=0.33 out=sorted.fa.gz zl=9 pigz=64 bgzip taxa tree=auto gi=ignore fastawrap=1023 minlen=60 readbufferlen=2 readbuffers=1 taxpath=$TAXPATH

Purpose: Sorts RefSeq sequences by taxonomic classification to optimize memory usage during subsequent sketch generation.

Key Parameters:

Memory Troubleshooting:

If sorting runs out of memory, use the alternative merging command:

mergesorted.sh -Xmx100g sort_temp* bgzip unbgzip zl=8 out=sorted.fa.gz taxa tree=auto fastawrap=1023 minlen=60 readbufferlen=2 readbuffers=1 ow taxpath=$TAXPATH

Or add the maxfiles=4 parameter to the original sortbyname command to limit concurrent file operations.

Stage 2: K-mer Blacklist Generation

sketchblacklist.sh -Xmx63g in=sorted.fa.gz prepasses=1 tree=auto taxa taxlevel=species ow out=blacklist_refseq_genus_140.sketch mincount=140 k=32,24 sizemult=2 taxpath=$TAXPATH

Purpose: Creates a blacklist of k-mers that occur in 140 or more different genera, removing uninformative k-mers that appear too frequently across diverse taxa.

Key Parameters:

Algorithm Details:

The blacklist generation uses a two-phase approach: first counting k-mer occurrences across taxonomic groups, then identifying k-mers that appear too frequently to be taxonomically informative. This removes universal or highly conserved k-mers that would interfere with accurate taxonomic classification.

Stage 3: Taxonomic Sketch Generation

bbsketch.sh -Xmx63g in=sorted.fa.gz out=taxa#.sketch mode=taxa tree=auto accession=null gi=null files=31 ow unpigz minsize=400 prefilter autosize blacklist=blacklist_refseq_genus_140.sketch k=32,24 depth sizemult=2 taxpath=$TAXPATH

Purpose: Generates 31 separate sketch files, with one sketch per species, optimized for rapid taxonomic classification of query sequences.

Key Parameters:

Multi-file Strategy:

The 31-file distribution enables parallel sketch loading and comparison. Each file contains a subset of species-level sketches, allowing the comparison tools to load and process subsets independently, significantly reducing memory requirements and enabling parallel processing.

Usage Examples

Basic Pipeline Execution

# Run the complete pipeline (ensure renamed.fa.gz exists) bash sketchRefSeq.sh

Query Against Generated Sketches

# Compare contigs against the new reference sketches comparesketch.sh in=contigs.fa k=32,24 tree=auto taxa*.sketch blacklist=blacklist_refseq_genus_140 taxpath=$TAXPATH

Using Default RefSeq Database (NERSC)

# On NERSC systems with default paths configured comparesketch.sh in=contigs.fa refseq tree=auto

This command automatically uses the default RefSeq sketch path, blacklist, and k-mer values configured for the system.

Memory Management Examples

# If sortbyname runs out of memory, use limited file approach sortbyname.sh -Xmx100g in=renamed.fa.gz maxfiles=4 out=sorted.fa.gz ...
# Alternative: merge pre-sorted temporary files mergesorted.sh -Xmx100g sort_temp* out=sorted.fa.gz ...

Output Files

Primary Outputs

Intermediate Files

File Characteristics

Algorithm Details

Taxonomic Sorting Strategy

The pipeline sorts sequences by taxonomy to enable memory-efficient sketch generation. When sketches are organized taxonomically, they can be written to disk immediately after creation rather than being held in memory. This approach dramatically reduces memory requirements for large-scale RefSeq processing.

Blacklist Generation Algorithm

The blacklist identifies k-mers that occur in 140+ different genera. These k-mers are typically:

Removing these k-mers improves taxonomic specificity and reduces false positive matches during classification.

Multi-K Strategy

The pipeline uses dual k-mer sizes (k=32,24) to balance sensitivity and specificity:

This dual-k approach enables robust taxonomic classification across varying evolutionary distances.

File Distribution Strategy

The 31-file output distribution provides several advantages:

System Configuration

SLURM Parameters

The pipeline includes SLURM job scheduler configuration for high-performance computing environments:

Memory Requirements

Performance Optimizations

Troubleshooting

Memory Issues

Sorting Stage

If sortbyname.sh runs out of memory:

  1. Increase -Xmx flag: Try -Xmx150g or higher if available
  2. Add maxfiles parameter: Use maxfiles=4 to limit concurrent file operations
  3. Use mergesorted fallback: Run the provided mergesorted.sh command to merge temporary files

Sketch Generation

If bbsketch.sh runs out of memory:

Taxonomy Issues

Integration with BBTools Ecosystem

Upstream Dependencies

Downstream Applications

Related Pipelines

Performance Characteristics

Runtime Estimates

Resource Utilization

Scalability Considerations

Output Validation

Quality Checks

# Verify sketch file creation ls -la taxa*.sketch # Check blacklist effectiveness wc -l blacklist_refseq_genus_140.sketch # Test sketch functionality comparesketch.sh in=test_sequence.fa taxa0.sketch

Expected Results

Validation Commands

# Count sketches in each file for f in taxa*.sketch; do echo "$f: $(sketch.sh in=$f list=t | wc -l) sketches" done # Test blacklist application bbsketch.sh in=test.fa blacklist=blacklist_refseq_genus_140.sketch

Configuration and Customization

Custom Taxonomy Path

For installations outside NERSC, modify the taxonomy path:

TAXPATH="/custom/path/to/taxonomy/"

Memory Optimization

For systems with limited memory, reduce memory allocations proportionally:

# For 32GB system, use roughly 1/3 of the standard allocations sortbyname.sh -Xmx30g ... sketchblacklist.sh -Xmx20g ... bbsketch.sh -Xmx20g ...

Alternative Blacklist Thresholds

Adjust the blacklist threshold based on your specificity requirements:

File Distribution Options

Adjust the number of output files based on your system:

Notes and Best Practices

Preprocessing Requirements

System Considerations

Quality Assurance

Maintenance and Updates