RefSeq Sketch Generation Pipeline
A comprehensive pipeline for generating taxonomic sketches from RefSeq genomic data. This script processes RefSeq sequences through taxonomic sorting, k-mer blacklist generation, and creates optimized sketch databases for rapid taxonomic classification and similarity analysis.
Overview
This pipeline transforms RefSeq genomic data into efficient sketch-based reference databases for taxonomic analysis. The process includes taxonomic sorting for memory optimization, k-mer blacklist generation to remove uninformative sequences, and creation of taxonomically-organized sketch files that enable rapid sequence comparison and classification.
Prerequisites
System Requirements
- BBTools suite - Complete installation with sketch tools
- High memory allocation - Pipeline requires up to 100GB RAM for sorting phase
- pigz - Parallel compression utility for efficient file handling
- Updated taxonomy database - Essential for proper taxonomic classification
- Adequate storage - Multiple large intermediate files are generated
Input Requirements
- renamed.fa.gz - RefSeq sequences renamed with taxonomic information
- Taxonomy database - BBTools taxonomy files (auto-detected or custom path)
- Minimum sequence length - Sequences must be ≥60bp for processing
Environment Configuration
TAXPATH="auto"
The pipeline uses automatic taxonomy path detection. For custom installations, modify this to your taxonomy directory path (e.g., TAXPATH="/path/to/taxonomy_directory/").
Pipeline Stages
Stage 1: Taxonomic Sorting
sortbyname.sh -Xmx100g in=renamed.fa.gz memmult=0.33 out=sorted.fa.gz zl=9 pigz=64 bgzip taxa tree=auto gi=ignore fastawrap=1023 minlen=60 readbufferlen=2 readbuffers=1 taxpath=$TAXPATH
Purpose: Sorts RefSeq sequences by taxonomic classification to optimize memory usage during subsequent sketch generation.
Key Parameters:
- -Xmx100g - Allocates 100GB memory for sorting large RefSeq datasets
- memmult=0.33 - Uses 33% of allocated memory for temporary sorting operations
- zl=9 - Maximum compression level for output file
- pigz=64 - Uses 64 threads for parallel compression
- bgzip - Uses bgzip compression format for block-based access
- taxa - Enables taxonomic sorting mode
- fastawrap=1023 - Sets FASTA line width to 1023 characters
- minlen=60 - Excludes sequences shorter than 60 bases
- gi=ignore - Ignores GI numbers in sequence headers
Memory Troubleshooting:
If sorting runs out of memory, use the alternative merging command:
mergesorted.sh -Xmx100g sort_temp* bgzip unbgzip zl=8 out=sorted.fa.gz taxa tree=auto fastawrap=1023 minlen=60 readbufferlen=2 readbuffers=1 ow taxpath=$TAXPATH
Or add the maxfiles=4
parameter to the original sortbyname command to limit concurrent file operations.
Stage 2: K-mer Blacklist Generation
sketchblacklist.sh -Xmx63g in=sorted.fa.gz prepasses=1 tree=auto taxa taxlevel=species ow out=blacklist_refseq_genus_140.sketch mincount=140 k=32,24 sizemult=2 taxpath=$TAXPATH
Purpose: Creates a blacklist of k-mers that occur in 140 or more different genera, removing uninformative k-mers that appear too frequently across diverse taxa.
Key Parameters:
- -Xmx63g - Allocates 63GB memory for k-mer frequency analysis
- prepasses=1 - Single preprocessing pass for k-mer counting
- taxlevel=species - Performs analysis at species taxonomic level
- mincount=140 - K-mers must appear in at least 140 different genera to be blacklisted
- k=32,24 - Uses dual k-mer sizes (32bp and 24bp) for robust analysis
- sizemult=2 - Doubles the default sketch size for higher sensitivity
- ow - Overwrites existing output files
Algorithm Details:
The blacklist generation uses a two-phase approach: first counting k-mer occurrences across taxonomic groups, then identifying k-mers that appear too frequently to be taxonomically informative. This removes universal or highly conserved k-mers that would interfere with accurate taxonomic classification.
Stage 3: Taxonomic Sketch Generation
bbsketch.sh -Xmx63g in=sorted.fa.gz out=taxa#.sketch mode=taxa tree=auto accession=null gi=null files=31 ow unpigz minsize=400 prefilter autosize blacklist=blacklist_refseq_genus_140.sketch k=32,24 depth sizemult=2 taxpath=$TAXPATH
Purpose: Generates 31 separate sketch files, with one sketch per species, optimized for rapid taxonomic classification of query sequences.
Key Parameters:
- out=taxa#.sketch - Creates numbered sketch files (taxa0.sketch, taxa1.sketch, etc.)
- mode=taxa - Generates one sketch per taxonomic unit (species level)
- files=31 - Distributes sketches across 31 output files for parallel processing
- accession=null, gi=null - Disables accession and GI number parsing for cleaner taxonomy
- minsize=400 - Minimum sketch size threshold
- prefilter - Enables prefiltering for memory efficiency
- autosize - Automatically adjusts sketch sizes based on genome characteristics
- depth - Includes k-mer depth information in sketches
- unpigz - Uses parallel decompression for input processing
Multi-file Strategy:
The 31-file distribution enables parallel sketch loading and comparison. Each file contains a subset of species-level sketches, allowing the comparison tools to load and process subsets independently, significantly reducing memory requirements and enabling parallel processing.
Usage Examples
Basic Pipeline Execution
# Run the complete pipeline (ensure renamed.fa.gz exists)
bash sketchRefSeq.sh
Query Against Generated Sketches
# Compare contigs against the new reference sketches
comparesketch.sh in=contigs.fa k=32,24 tree=auto taxa*.sketch blacklist=blacklist_refseq_genus_140 taxpath=$TAXPATH
Using Default RefSeq Database (NERSC)
# On NERSC systems with default paths configured
comparesketch.sh in=contigs.fa refseq tree=auto
This command automatically uses the default RefSeq sketch path, blacklist, and k-mer values configured for the system.
Memory Management Examples
# If sortbyname runs out of memory, use limited file approach
sortbyname.sh -Xmx100g in=renamed.fa.gz maxfiles=4 out=sorted.fa.gz ...
# Alternative: merge pre-sorted temporary files
mergesorted.sh -Xmx100g sort_temp* out=sorted.fa.gz ...
Output Files
Primary Outputs
- sorted.fa.gz - Taxonomically sorted RefSeq sequences
- blacklist_refseq_genus_140.sketch - K-mer blacklist for filtering uninformative sequences
- taxa0.sketch through taxa30.sketch - 31 sketch files containing species-level genomic sketches
Intermediate Files
- sort_temp* - Temporary files created during sorting (can be deleted after successful completion)
- log_*.out, log_*.err - SLURM job output and error logs
File Characteristics
- Compression - All outputs use high-level compression (zl=9) for storage efficiency
- Format - Sketch files use BBTools binary format optimized for rapid loading
- Taxonomy Integration - All sketches include taxonomic metadata for classification
Algorithm Details
Taxonomic Sorting Strategy
The pipeline sorts sequences by taxonomy to enable memory-efficient sketch generation. When sketches are organized taxonomically, they can be written to disk immediately after creation rather than being held in memory. This approach dramatically reduces memory requirements for large-scale RefSeq processing.
Blacklist Generation Algorithm
The blacklist identifies k-mers that occur in 140+ different genera. These k-mers are typically:
- Universal sequences - Highly conserved regions present across many organisms
- Low-complexity regions - Repetitive or low-information content sequences
- Contamination artifacts - Sequences that may represent technical artifacts
Removing these k-mers improves taxonomic specificity and reduces false positive matches during classification.
Multi-K Strategy
The pipeline uses dual k-mer sizes (k=32,24) to balance sensitivity and specificity:
- k=32 - Higher specificity for closely related species discrimination
- k=24 - Higher sensitivity for detecting more distant relationships
This dual-k approach enables robust taxonomic classification across varying evolutionary distances.
File Distribution Strategy
The 31-file output distribution provides several advantages:
- Parallel Loading - Comparison tools can load subsets independently
- Memory Management - Reduces peak memory usage during sketch comparison
- Scalability - Enables processing on systems with limited memory
- Load Balancing - Distributes computational load across file subsets
System Configuration
SLURM Parameters
The pipeline includes SLURM job scheduler configuration for high-performance computing environments:
- Job Name: sketch_refseq
- Queue: genepool
- Account: gtrqc
- Time Limit: 71 hours
- Architecture: Haswell processors
- Exclusive Access: Entire node allocation
Memory Requirements
- Sorting Phase: Up to 100GB RAM (with memmult=0.33 using ~33GB actively)
- Blacklist Generation: 63GB RAM for k-mer frequency analysis
- Sketch Creation: 63GB RAM for taxonomic sketch generation
Performance Optimizations
- Parallel Compression: pigz=64 uses 64 threads for compression/decompression
- Read Buffering: Optimized I/O with readbufferlen=2, readbuffers=1
- Taxonomic Sorting: Enables immediate sketch writing, reducing memory usage
- Prefiltering: Available for memory-constrained environments
Troubleshooting
Memory Issues
Sorting Stage
If sortbyname.sh runs out of memory:
- Increase -Xmx flag: Try -Xmx150g or higher if available
- Add maxfiles parameter: Use maxfiles=4 to limit concurrent file operations
- Use mergesorted fallback: Run the provided mergesorted.sh command to merge temporary files
Sketch Generation
If bbsketch.sh runs out of memory:
- Enable prefilter: Add prefilter=1 or prefilter=2 for additional memory efficiency
- Reduce sizemult: Lower sizemult from 2 to 1 for smaller sketches
- Increase file count: Use more output files (files=62 instead of files=31)
Taxonomy Issues
- Outdated taxonomy: Update BBTools taxonomy database before running
- Custom taxonomy path: Modify TAXPATH variable for non-standard installations
- Missing GI mappings: Ensure renamed.fa.gz includes proper taxonomic headers
Integration with BBTools Ecosystem
Upstream Dependencies
- RefSeq Download: Requires preprocessed RefSeq data with taxonomic headers
- Sequence Renaming: Input must be renamed with taxonomic information
- Taxonomy Database: Updated BBTools taxonomy files are essential
Downstream Applications
- comparesketch.sh: Compare query sequences against generated sketches
- sendsketch.sh: Submit queries to remote sketch databases
- Taxonomic Classification: Rapid identification of unknown sequences
- Contamination Detection: Identify non-target organisms in sequencing data
- Quality Control: Verify expected taxonomic composition
Related Pipelines
- fetchRefSeq.sh: Downloads and preprocesses RefSeq data
- renameRefSeq.sh: Adds taxonomic information to sequence headers
- updateTaxonomy.sh: Updates BBTools taxonomy database
Performance Characteristics
Runtime Estimates
- Sorting Phase: 10-20 hours depending on RefSeq size and system performance
- Blacklist Generation: 5-10 hours for comprehensive k-mer frequency analysis
- Sketch Creation: 8-15 hours for complete taxonomic sketch generation
- Total Runtime: 25-45 hours for complete pipeline execution
Resource Utilization
- CPU: High utilization during compression/decompression phases
- Memory: Peak usage during sorting phase (up to 100GB)
- I/O: Intensive during file reading/writing phases
- Storage: Multiple large temporary files require substantial disk space
Scalability Considerations
- RefSeq Growth: Pipeline scales with RefSeq database size
- Memory Scaling: May require memory increases for future RefSeq versions
- File Distribution: Can increase file count for better parallelization
- Network Storage: Consider I/O patterns for network filesystem deployments
Output Validation
Quality Checks
# Verify sketch file creation
ls -la taxa*.sketch
# Check blacklist effectiveness
wc -l blacklist_refseq_genus_140.sketch
# Test sketch functionality
comparesketch.sh in=test_sequence.fa taxa0.sketch
Expected Results
- Sketch Files: 31 files (taxa0.sketch through taxa30.sketch) with non-zero sizes
- Blacklist: Substantial blacklist file indicating effective k-mer filtering
- Sorted Data: sorted.fa.gz should be similar size to input renamed.fa.gz
- Log Files: No critical errors in SLURM output logs
Validation Commands
# Count sketches in each file
for f in taxa*.sketch; do
echo "$f: $(sketch.sh in=$f list=t | wc -l) sketches"
done
# Test blacklist application
bbsketch.sh in=test.fa blacklist=blacklist_refseq_genus_140.sketch
Configuration and Customization
Custom Taxonomy Path
For installations outside NERSC, modify the taxonomy path:
TAXPATH="/custom/path/to/taxonomy/"
Memory Optimization
For systems with limited memory, reduce memory allocations proportionally:
# For 32GB system, use roughly 1/3 of the standard allocations
sortbyname.sh -Xmx30g ...
sketchblacklist.sh -Xmx20g ...
bbsketch.sh -Xmx20g ...
Alternative Blacklist Thresholds
Adjust the blacklist threshold based on your specificity requirements:
- mincount=100: More permissive, retains more k-mers
- mincount=200: More stringent, removes more common k-mers
- mincount=140: Default balanced threshold
File Distribution Options
Adjust the number of output files based on your system:
- files=15: Fewer files for smaller systems
- files=62: More files for larger parallel systems
- files=31: Default balanced approach
Notes and Best Practices
Preprocessing Requirements
- Update Taxonomy: Always ensure taxonomy database is current before running
- Sequence Quality: Input sequences should be quality-filtered and properly named
- Header Format: RefSeq headers must include taxonomic information for proper classification
- File Preparation: Ensure renamed.fa.gz is properly formatted and accessible
System Considerations
- Storage Space: Pipeline generates multiple large files; ensure adequate disk space
- I/O Performance: Use high-performance storage for optimal throughput
- Network Considerations: For network storage, consider I/O patterns and bandwidth
- Backup Strategy: Consider backing up final sketch files before deletion of intermediates
Quality Assurance
- Validation Testing: Test generated sketches with known sequences
- Comparative Analysis: Compare results with previous sketch versions
- Coverage Verification: Ensure taxonomic coverage meets expectations
- Performance Testing: Benchmark query performance against generated sketches
Maintenance and Updates
- Regular Updates: Regenerate sketches when new RefSeq versions are released
- Taxonomy Synchronization: Keep taxonomy database synchronized with RefSeq versions
- Archive Management: Archive old sketch versions for reproducibility
- Documentation Updates: Update parameter documentation for customizations