RefSeq Sketch Generation Pipeline

Overview

This pipeline transforms RefSeq genomic data into efficient sketch-based reference databases for taxonomic analysis. The process includes taxonomic sorting for memory optimization, k-mer blacklist generation to remove uninformative sequences, and creation of taxonomically-organized sketch files that enable rapid sequence comparison and classification.

Important: Ensure taxonomy data is updated before running this pipeline. The script is designed for NERSC systems but can be adapted for other environments by modifying the TAXPATH variable to point to your BBTools taxonomy directory.

Prerequisites

System Requirements

BBTools suite - Complete installation with sketch tools
High memory allocation - Pipeline requires up to 100GB RAM for sorting phase
pigz - Parallel compression utility for efficient file handling
Updated taxonomy database - Essential for proper taxonomic classification
Adequate storage - Multiple large intermediate files are generated

Input Requirements

renamed.fa.gz - RefSeq sequences renamed with taxonomic information
Taxonomy database - BBTools taxonomy files (auto-detected or custom path)
Minimum sequence length - Sequences must be ≥60bp for processing

Environment Configuration

TAXPATH="auto"

The pipeline uses automatic taxonomy path detection. For custom installations, modify this to your taxonomy directory path (e.g., TAXPATH="/path/to/taxonomy_directory/").

Pipeline Stages

Stage 1: Taxonomic Sorting

                sortbyname.sh -Xmx100g in=renamed.fa.gz memmult=0.33 out=sorted.fa.gz zl=9 pigz=64 bgzip taxa tree=auto gi=ignore fastawrap=1023 minlen=60 readbufferlen=2 readbuffers=1 taxpath=$TAXPATH
            

Purpose: Sorts RefSeq sequences by taxonomic classification to optimize memory usage during subsequent sketch generation.

Key Parameters:

-Xmx100g - Allocates 100GB memory for sorting large RefSeq datasets
memmult=0.33 - Uses 33% of allocated memory for temporary sorting operations
zl=9 - Maximum compression level for output file
pigz=64 - Uses 64 threads for parallel compression
bgzip - Uses bgzip compression format for block-based access
taxa - Enables taxonomic sorting mode
fastawrap=1023 - Sets FASTA line width to 1023 characters
minlen=60 - Excludes sequences shorter than 60 bases
gi=ignore - Ignores GI numbers in sequence headers

Memory Troubleshooting:

If sorting runs out of memory, use the alternative merging command:

                mergesorted.sh -Xmx100g sort_temp* bgzip unbgzip zl=8 out=sorted.fa.gz taxa tree=auto fastawrap=1023 minlen=60 readbufferlen=2 readbuffers=1 ow taxpath=$TAXPATH
            

Or add the maxfiles=4 parameter to the original sortbyname command to limit concurrent file operations.

Stage 2: K-mer Blacklist Generation

                sketchblacklist.sh -Xmx63g in=sorted.fa.gz prepasses=1 tree=auto taxa taxlevel=species ow out=blacklist_refseq_genus_140.sketch mincount=140 k=32,24 sizemult=2 taxpath=$TAXPATH
            

Purpose: Creates a blacklist of k-mers that occur in 140 or more different genera, removing uninformative k-mers that appear too frequently across diverse taxa.

Key Parameters:

-Xmx63g - Allocates 63GB memory for k-mer frequency analysis
prepasses=1 - Single preprocessing pass for k-mer counting
taxlevel=species - Performs analysis at species taxonomic level
mincount=140 - K-mers must appear in at least 140 different genera to be blacklisted
k=32,24 - Uses dual k-mer sizes (32bp and 24bp) for robust analysis
sizemult=2 - Doubles the default sketch size for higher sensitivity
ow - Overwrites existing output files

Algorithm Details:

The blacklist generation uses a two-phase approach: first counting k-mer occurrences across taxonomic groups, then identifying k-mers that appear too frequently to be taxonomically informative. This removes universal or highly conserved k-mers that would interfere with accurate taxonomic classification.

Stage 3: Taxonomic Sketch Generation

                bbsketch.sh -Xmx63g in=sorted.fa.gz out=taxa#.sketch mode=taxa tree=auto accession=null gi=null files=31 ow unpigz minsize=400 prefilter autosize blacklist=blacklist_refseq_genus_140.sketch k=32,24 depth sizemult=2 taxpath=$TAXPATH
            

Purpose: Generates 31 separate sketch files, with one sketch per species, optimized for rapid taxonomic classification of query sequences.

Key Parameters:

out=taxa#.sketch - Creates numbered sketch files (taxa0.sketch, taxa1.sketch, etc.)
mode=taxa - Generates one sketch per taxonomic unit (species level)
files=31 - Distributes sketches across 31 output files for parallel processing
accession=null, gi=null - Disables accession and GI number parsing for cleaner taxonomy
minsize=400 - Minimum sketch size threshold
prefilter - Enables prefiltering for memory efficiency
autosize - Automatically adjusts sketch sizes based on genome characteristics
depth - Includes k-mer depth information in sketches
unpigz - Uses parallel decompression for input processing

Multi-file Strategy:

The 31-file distribution enables parallel sketch loading and comparison. Each file contains a subset of species-level sketches, allowing the comparison tools to load and process subsets independently, significantly reducing memory requirements and enabling parallel processing.

Usage Examples

Basic Pipeline Execution

                # Run the complete pipeline (ensure renamed.fa.gz exists)
bash sketchRefSeq.sh
            

Query Against Generated Sketches

                # Compare contigs against the new reference sketches
comparesketch.sh in=contigs.fa k=32,24 tree=auto taxa*.sketch blacklist=blacklist_refseq_genus_140 taxpath=$TAXPATH
            

Using Default RefSeq Database (NERSC)

                # On NERSC systems with default paths configured
comparesketch.sh in=contigs.fa refseq tree=auto
            

This command automatically uses the default RefSeq sketch path, blacklist, and k-mer values configured for the system.

Memory Management Examples

                # If sortbyname runs out of memory, use limited file approach
sortbyname.sh -Xmx100g in=renamed.fa.gz maxfiles=4 out=sorted.fa.gz ...
            

                # Alternative: merge pre-sorted temporary files
mergesorted.sh -Xmx100g sort_temp* out=sorted.fa.gz ...
            

Output Files

Primary Outputs

sorted.fa.gz - Taxonomically sorted RefSeq sequences
blacklist_refseq_genus_140.sketch - K-mer blacklist for filtering uninformative sequences
taxa0.sketch through taxa30.sketch - 31 sketch files containing species-level genomic sketches

Intermediate Files

sort_temp* - Temporary files created during sorting (can be deleted after successful completion)
log_*.out, log_*.err - SLURM job output and error logs

File Characteristics

Compression - All outputs use high-level compression (zl=9) for storage efficiency
Format - Sketch files use BBTools binary format optimized for rapid loading
Taxonomy Integration - All sketches include taxonomic metadata for classification

Algorithm Details

Taxonomic Sorting Strategy

The pipeline sorts sequences by taxonomy to enable memory-efficient sketch generation. When sketches are organized taxonomically, they can be written to disk immediately after creation rather than being held in memory. This approach dramatically reduces memory requirements for large-scale RefSeq processing.

Blacklist Generation Algorithm

The blacklist identifies k-mers that occur in 140+ different genera. These k-mers are typically:

Universal sequences - Highly conserved regions present across many organisms
Low-complexity regions - Repetitive or low-information content sequences
Contamination artifacts - Sequences that may represent technical artifacts

Removing these k-mers improves taxonomic specificity and reduces false positive matches during classification.

Multi-K Strategy

The pipeline uses dual k-mer sizes (k=32,24) to balance sensitivity and specificity:

k=32 - Higher specificity for closely related species discrimination
k=24 - Higher sensitivity for detecting more distant relationships

This dual-k approach enables robust taxonomic classification across varying evolutionary distances.

File Distribution Strategy

The 31-file output distribution provides several advantages:

Parallel Loading - Comparison tools can load subsets independently
Memory Management - Reduces peak memory usage during sketch comparison
Scalability - Enables processing on systems with limited memory
Load Balancing - Distributes computational load across file subsets

System Configuration

SLURM Parameters

The pipeline includes SLURM job scheduler configuration for high-performance computing environments:

Job Name: sketch_refseq
Queue: genepool
Account: gtrqc
Time Limit: 71 hours
Architecture: Haswell processors
Exclusive Access: Entire node allocation

Memory Requirements

Sorting Phase: Up to 100GB RAM (with memmult=0.33 using ~33GB actively)
Blacklist Generation: 63GB RAM for k-mer frequency analysis
Sketch Creation: 63GB RAM for taxonomic sketch generation

Performance Optimizations

Parallel Compression: pigz=64 uses 64 threads for compression/decompression
Read Buffering: Optimized I/O with readbufferlen=2, readbuffers=1
Taxonomic Sorting: Enables immediate sketch writing, reducing memory usage
Prefiltering: Available for memory-constrained environments

Troubleshooting

Memory Issues

Sorting Stage

If sortbyname.sh runs out of memory:

Increase -Xmx flag: Try -Xmx150g or higher if available
Add maxfiles parameter: Use maxfiles=4 to limit concurrent file operations
Use mergesorted fallback: Run the provided mergesorted.sh command to merge temporary files

Sketch Generation

If bbsketch.sh runs out of memory:

Enable prefilter: Add prefilter=1 or prefilter=2 for additional memory efficiency
Reduce sizemult: Lower sizemult from 2 to 1 for smaller sketches
Increase file count: Use more output files (files=62 instead of files=31)

Taxonomy Issues

Outdated taxonomy: Update BBTools taxonomy database before running
Custom taxonomy path: Modify TAXPATH variable for non-standard installations
Missing GI mappings: Ensure renamed.fa.gz includes proper taxonomic headers

Integration with BBTools Ecosystem

Upstream Dependencies

RefSeq Download: Requires preprocessed RefSeq data with taxonomic headers
Sequence Renaming: Input must be renamed with taxonomic information
Taxonomy Database: Updated BBTools taxonomy files are essential

Downstream Applications

comparesketch.sh: Compare query sequences against generated sketches
sendsketch.sh: Submit queries to remote sketch databases
Taxonomic Classification: Rapid identification of unknown sequences
Contamination Detection: Identify non-target organisms in sequencing data
Quality Control: Verify expected taxonomic composition

Related Pipelines

fetchRefSeq.sh: Downloads and preprocesses RefSeq data
renameRefSeq.sh: Adds taxonomic information to sequence headers
updateTaxonomy.sh: Updates BBTools taxonomy database

Performance Characteristics

Runtime Estimates

Sorting Phase: 10-20 hours depending on RefSeq size and system performance
Blacklist Generation: 5-10 hours for comprehensive k-mer frequency analysis
Sketch Creation: 8-15 hours for complete taxonomic sketch generation
Total Runtime: 25-45 hours for complete pipeline execution

Resource Utilization

CPU: High utilization during compression/decompression phases
Memory: Peak usage during sorting phase (up to 100GB)
I/O: Intensive during file reading/writing phases
Storage: Multiple large temporary files require substantial disk space

Scalability Considerations

RefSeq Growth: Pipeline scales with RefSeq database size
Memory Scaling: May require memory increases for future RefSeq versions
File Distribution: Can increase file count for better parallelization
Network Storage: Consider I/O patterns for network filesystem deployments

Output Validation

Quality Checks

                # Verify sketch file creation
ls -la taxa*.sketch

# Check blacklist effectiveness
wc -l blacklist_refseq_genus_140.sketch

# Test sketch functionality
comparesketch.sh in=test_sequence.fa taxa0.sketch
            

Expected Results

Sketch Files: 31 files (taxa0.sketch through taxa30.sketch) with non-zero sizes
Blacklist: Substantial blacklist file indicating effective k-mer filtering
Sorted Data: sorted.fa.gz should be similar size to input renamed.fa.gz
Log Files: No critical errors in SLURM output logs

Validation Commands

                # Count sketches in each file
for f in taxa*.sketch; do
    echo "$f: $(sketch.sh in=$f list=t | wc -l) sketches"
done

# Test blacklist application
bbsketch.sh in=test.fa blacklist=blacklist_refseq_genus_140.sketch
            

Configuration and Customization

Custom Taxonomy Path

For installations outside NERSC, modify the taxonomy path:

TAXPATH="/custom/path/to/taxonomy/"

Memory Optimization

For systems with limited memory, reduce memory allocations proportionally:

                # For 32GB system, use roughly 1/3 of the standard allocations
sortbyname.sh -Xmx30g ...
sketchblacklist.sh -Xmx20g ...
bbsketch.sh -Xmx20g ...
            

Alternative Blacklist Thresholds

Adjust the blacklist threshold based on your specificity requirements:

mincount=100: More permissive, retains more k-mers
mincount=200: More stringent, removes more common k-mers
mincount=140: Default balanced threshold

File Distribution Options

Adjust the number of output files based on your system:

files=15: Fewer files for smaller systems
files=62: More files for larger parallel systems
files=31: Default balanced approach

Notes and Best Practices

Preprocessing Requirements

Update Taxonomy: Always ensure taxonomy database is current before running
Sequence Quality: Input sequences should be quality-filtered and properly named
Header Format: RefSeq headers must include taxonomic information for proper classification
File Preparation: Ensure renamed.fa.gz is properly formatted and accessible

System Considerations

Storage Space: Pipeline generates multiple large files; ensure adequate disk space
I/O Performance: Use high-performance storage for optimal throughput
Network Considerations: For network storage, consider I/O patterns and bandwidth
Backup Strategy: Consider backing up final sketch files before deletion of intermediates

Quality Assurance

Validation Testing: Test generated sketches with known sequences
Comparative Analysis: Compare results with previous sketch versions
Coverage Verification: Ensure taxonomic coverage meets expectations
Performance Testing: Benchmark query performance against generated sketches

Maintenance and Updates

Regular Updates: Regenerate sketches when new RefSeq versions are released
Taxonomy Synchronization: Keep taxonomy database synchronized with RefSeq versions
Archive Management: Archive old sketch versions for reproducibility
Documentation Updates: Update parameter documentation for customizations