SSU Covering Set Generator
Specialized pipeline for generating a minimal kmer covering set from Silva SSU (Small Subunit) ribosomal RNA sequences. This creates an optimized reference set where every input SSU sequence contains at least one kmer from the covering set, enabling efficient taxonomic classification and database queries.
Overview
This pipeline is specifically designed for processing Silva SSU ribosomal RNA databases to create a minimal covering set of 31-mers. The covering set ensures that every SSU sequence in the database contains at least one kmer from the set, which dramatically improves the speed of taxonomic classification and sequence matching operations.
The pipeline uses a greedy algorithm approach that first generates an initial covering set from the complete sequences, then creates shredded fragments to fill coverage gaps, ensuring comprehensive representation of all SSU diversity.
Prerequisites
System Requirements
- BBTools suite installed
- At least 8GB RAM (recommended for -Xmx8g settings)
- Sufficient disk space for temporary files and shredded sequences
Input Requirements
- ssu_deduped100pct.fa.gz - Silva SSU database deduplicated at 100% identity
- Input sequences should be high-quality SSU ribosomal RNA sequences
- Sequences should be properly formatted with Silva-style headers
Pipeline Stages
Stage 1: Initial Covering Set Generation
kmerfilterset.sh in=ssu_deduped100pct.fa.gz k=31 rcomp=f out=ssu_covering_31mers.fa maxkpp=1 -Xmx8g
Purpose: Generates an initial minimal covering set of 31-mers from the complete SSU database.
Key Parameters:
- k=31 - Uses 31-mer kmers for optimal specificity/sensitivity balance
- rcomp=f - Disables reverse-complement consideration (forward strand only)
- maxkpp=1 - Retains only 1 kmer per pass for minimal set size
- -Xmx8g - Allocates 8GB memory for processing large databases
Algorithm: Uses a greedy approach that selects the most frequent kmers in each pass, removes sequences containing those kmers, and repeats until all sequences are covered. This creates a compact initial covering set.
Stage 2: Sequence Padding
reformat.sh in=ssu_deduped100pct.fa.gz out=ssu_deduped100pct_padded.fa.gz padleft=31 padright=31 ow
Purpose: Adds 31 bases of padding to both ends of each sequence to ensure complete kmer coverage at sequence boundaries.
Key Parameters:
- padleft=31 - Adds 31 Ns to the left end of each sequence
- padright=31 - Adds 31 Ns to the right end of each sequence
- ow - Overwrites output file if it exists
Rationale: Padding ensures that kmers can be generated from the very ends of sequences, preventing coverage gaps at sequence boundaries that could reduce the effectiveness of the covering set.
Stage 3: Sequence Shredding
shred.sh in=ssu_deduped100pct_padded.fa.gz length=150 minlength=62 overlap=30 out=shredsSsu.fa.gz ow
Purpose: Breaks the padded sequences into overlapping fragments to create additional kmer combinations and improve covering set completeness.
Key Parameters:
- length=150 - Target length for each shredded fragment
- minlength=62 - Minimum acceptable fragment length (ensures at least 32 kmers per fragment)
- overlap=30 - 30-base overlap between consecutive fragments
Strategy: The overlapping shreds create additional kmer contexts that may not be present in the original full-length sequences, helping to identify kmers that can distinguish between closely related SSU sequences.
Stage 4: Comprehensive Covering Set Generation
time kmerfilterset.sh in=shredsSsu.fa.gz initial=ssu_covering_31mers.fa k=31 rcomp=f out=ssu_shred_covering_31mers.fa maxkpp=1 maxpasses=200000 fastawrap=99999 -Xmx8g ow
Purpose: Creates the final comprehensive covering set by processing the shredded sequences and incorporating the initial covering set.
Key Parameters:
- initial=ssu_covering_31mers.fa - Uses Stage 1 output as starting point
- maxpasses=200000 - Allows extensive iteration for thorough coverage
- fastawrap=99999 - Prevents line wrapping in output for processing efficiency
- time - Times the execution to monitor performance
Algorithm: Iteratively processes shredded sequences using the greedy covering set algorithm, starting with the initial covering set and adding kmers until every shredded fragment contains at least one kmer from the final set.
Algorithm Details
The SSU covering set generation employs a sophisticated two-phase approach optimized for ribosomal RNA diversity:
Greedy Covering Set Algorithm
The core algorithm uses a greedy approach implemented in KmerFilterSetMaker.java:
- Pass-based Processing: Each pass identifies the most frequent kmers in the remaining sequences
- Minimal Retention: maxkpp=1 ensures only the single most effective kmer is retained per pass
- Progressive Filtering: Sequences containing selected kmers are removed from subsequent passes
- Coverage Guarantee: Algorithm continues until every input sequence contains at least one selected kmer
Two-Phase Strategy
- Full-Length Phase: Generates initial covering set from complete SSU sequences, capturing major taxonomic groups
- Fragment Phase: Processes overlapping shreds to identify kmers distinguishing closely related sequences
SSU-Specific Optimizations
- k=31: 31-mer length provides optimal balance between specificity and computational efficiency for SSU sequences
- Forward Strand Only: rcomp=f prevents redundant coverage since SSU databases are standardized to forward orientation
- Boundary Padding: 31-base padding ensures complete kmer coverage at sequence ends
- Overlap Strategy: 30-base overlaps with 150bp fragments create comprehensive kmer contexts
Memory and Performance
- Memory Scaling: 8GB allocation handles large Silva databases efficiently
- Pass Limitation: maxpasses=200000 prevents infinite loops while allowing thorough processing
- Kmer Selection: Greedy selection minimizes set size while maintaining coverage guarantees
Basic Usage
# 1. Ensure input file is present
ls -la ssu_deduped100pct.fa.gz
# 2. Run the covering set generation pipeline
bash pipelines/silva/makeCoveringSetSsu.sh
# 3. Verify outputs
ls -la ssu_covering_31mers.fa ssu_shred_covering_31mers.fa
Expected Inputs and Outputs
Input Files
- ssu_deduped100pct.fa.gz - Silva SSU database deduplicated at 100% identity (required)
Output Files
- ssu_covering_31mers.fa - Initial covering set from full-length sequences
- ssu_deduped100pct_padded.fa.gz - Padded sequences (intermediate file)
- shredsSsu.fa.gz - Shredded overlapping fragments (intermediate file)
- ssu_shred_covering_31mers.fa - Final comprehensive covering set
Temporary Files
The pipeline generates temporary files during kmerfilterset processing. These are automatically cleaned up upon completion.
Performance Characteristics
Memory Usage
- Base Requirement: 8GB RAM allocation for kmer table construction
- Peak Usage: Memory usage scales with database size and kmer diversity
- Optimization: Forward-strand-only processing reduces memory footprint by ~50%
Execution Time
- Stage 1: Initial covering set generation - typically fastest phase
- Stage 2: Sequence padding - very fast reformatting operation
- Stage 3: Sequence shredding - moderate time, creates many fragments
- Stage 4: Comprehensive covering set - longest phase due to maxpasses=200000
Output Size Optimization
- Greedy Selection: maxkpp=1 creates near-minimal covering sets
- Two-Phase Approach: Balances coverage completeness with set size
- Further Compression: Output can be compressed using kcompress.sh for additional space savings
Applications
Taxonomic Classification
The covering set enables rapid taxonomic classification of unknown SSU sequences by guaranteeing that every known SSU sequence will have at least one kmer match in the covering set.
Database Indexing
Covering sets can be used as compact indices for large SSU databases, allowing fast preliminary screening before more detailed alignment-based analysis.
Quality Control
The covering set can identify sequences that may be chimeric or low-quality by checking for expected kmer representation patterns in SSU data.
Technical Details
Kmer Selection Strategy
- Frequency-Based Selection: Algorithm prioritizes highly frequent kmers that cover many sequences
- Iterative Refinement: Each pass removes covered sequences, focusing on previously uncovered diversity
- Minimal Redundancy: Single kmer per pass (maxkpp=1) prevents set inflation
Coverage Guarantee
- Complete Coverage: Every input sequence guaranteed to contain ≥1 kmer from final set
- Boundary Handling: Sequence padding ensures kmers can be extracted from sequence ends
- Fragment Coverage: Shredding creates overlapping contexts for comprehensive kmer representation
SSU-Specific Considerations
- Directional Processing: rcomp=f assumes Silva sequences are in standard forward orientation
- Variable Regions: Shredding captures kmers from both conserved and variable SSU regions
- Phylogenetic Diversity: Two-phase approach ensures representation across SSU phylogenetic breadth
Workflow Integration
Silva Database Preparation
This pipeline is typically used as part of Silva database processing workflows:
- Silva database download and formatting
- Sequence deduplication at 100% identity
- Covering set generation (this pipeline)
- Database indexing and deployment
Downstream Applications
- SendSketch: Covering sets accelerate taxonomic sketching operations
- BBSketch: Can use covering sets for rapid similarity estimation
- Taxonomic Servers: Covering sets reduce memory requirements for SSU classification services
Notes and Tips
- The pipeline requires the specific input filename "ssu_deduped100pct.fa.gz" to be present in the working directory
- Stage 4 is timed to monitor performance - large databases may require substantial processing time
- The maxpasses=200000 setting allows for very thorough processing but may not be needed for smaller databases
- Output files can be further compressed using kcompress.sh to reduce storage requirements
- The covering set quality depends on the input database quality - ensure SSU sequences are properly curated
- Memory requirements scale with database size; increase -Xmx values for larger datasets if needed
Related Tools
- kmerfilterset.sh - Core tool for generating covering sets from any sequence collection
- reformat.sh - Sequence formatting and padding utility
- shred.sh - Sequence fragmentation tool for creating overlapping subsequences
- filtersilva.sh - Silva database cleaning and filtering
- kcompress.sh - Kmer set compression for storage optimization
Troubleshooting
Memory Issues
- Increase -Xmx setting if out-of-memory errors occur
- Consider processing smaller subsets for very large databases
- Monitor system memory usage during Stage 4 processing
Performance Issues
- Stage 4 may take significant time for large databases - this is expected
- Use the timing output to estimate total runtime
- Consider reducing maxpasses for faster processing with potentially larger covering sets
File Issues
- Ensure ssu_deduped100pct.fa.gz is properly formatted and accessible
- Check disk space for temporary files and output
- Verify write permissions in the working directory