SSU Covering Set Generator

Overview

This pipeline is specifically designed for processing Silva SSU ribosomal RNA databases to create a minimal covering set of 31-mers. The covering set ensures that every SSU sequence in the database contains at least one kmer from the set, which dramatically improves the speed of taxonomic classification and sequence matching operations.

The pipeline uses a greedy algorithm approach that first generates an initial covering set from the complete sequences, then creates shredded fragments to fill coverage gaps, ensuring comprehensive representation of all SSU diversity.

Note: This pipeline is designed to work with Silva database files and expects input in the specific format "ssu_deduped100pct.fa.gz" containing deduplicated SSU sequences at 100% identity clustering.

Prerequisites

System Requirements

BBTools suite installed
At least 8GB RAM (recommended for -Xmx8g settings)
Sufficient disk space for temporary files and shredded sequences

Input Requirements

ssu_deduped100pct.fa.gz - Silva SSU database deduplicated at 100% identity
Input sequences should be high-quality SSU ribosomal RNA sequences
Sequences should be properly formatted with Silva-style headers

Pipeline Stages

Stage 1: Initial Covering Set Generation

kmerfilterset.sh in=ssu_deduped100pct.fa.gz k=31 rcomp=f out=ssu_covering_31mers.fa maxkpp=1 -Xmx8g

Purpose: Generates an initial minimal covering set of 31-mers from the complete SSU database.

Key Parameters:

k=31 - Uses 31-mer kmers for optimal specificity/sensitivity balance
rcomp=f - Disables reverse-complement consideration (forward strand only)
maxkpp=1 - Retains only 1 kmer per pass for minimal set size
-Xmx8g - Allocates 8GB memory for processing large databases

Algorithm: Uses a greedy approach that selects the most frequent kmers in each pass, removes sequences containing those kmers, and repeats until all sequences are covered. This creates a compact initial covering set.

Stage 2: Sequence Padding

reformat.sh in=ssu_deduped100pct.fa.gz out=ssu_deduped100pct_padded.fa.gz padleft=31 padright=31 ow

Purpose: Adds 31 bases of padding to both ends of each sequence to ensure complete kmer coverage at sequence boundaries.

Key Parameters:

padleft=31 - Adds 31 Ns to the left end of each sequence
padright=31 - Adds 31 Ns to the right end of each sequence
ow - Overwrites output file if it exists

Rationale: Padding ensures that kmers can be generated from the very ends of sequences, preventing coverage gaps at sequence boundaries that could reduce the effectiveness of the covering set.

Stage 3: Sequence Shredding

shred.sh in=ssu_deduped100pct_padded.fa.gz length=150 minlength=62 overlap=30 out=shredsSsu.fa.gz ow

Purpose: Breaks the padded sequences into overlapping fragments to create additional kmer combinations and improve covering set completeness.

Key Parameters:

length=150 - Target length for each shredded fragment
minlength=62 - Minimum acceptable fragment length (ensures at least 32 kmers per fragment)
overlap=30 - 30-base overlap between consecutive fragments

Strategy: The overlapping shreds create additional kmer contexts that may not be present in the original full-length sequences, helping to identify kmers that can distinguish between closely related SSU sequences.

Stage 4: Comprehensive Covering Set Generation

time kmerfilterset.sh in=shredsSsu.fa.gz initial=ssu_covering_31mers.fa k=31 rcomp=f out=ssu_shred_covering_31mers.fa maxkpp=1 maxpasses=200000 fastawrap=99999 -Xmx8g ow

Purpose: Creates the final comprehensive covering set by processing the shredded sequences and incorporating the initial covering set.

Key Parameters:

initial=ssu_covering_31mers.fa - Uses Stage 1 output as starting point
maxpasses=200000 - Allows extensive iteration for thorough coverage
fastawrap=99999 - Prevents line wrapping in output for processing efficiency
time - Times the execution to monitor performance

Algorithm: Iteratively processes shredded sequences using the greedy covering set algorithm, starting with the initial covering set and adding kmers until every shredded fragment contains at least one kmer from the final set.

Algorithm Details

The SSU covering set generation employs a sophisticated two-phase approach optimized for ribosomal RNA diversity:

Greedy Covering Set Algorithm

The core algorithm uses a greedy approach implemented in KmerFilterSetMaker.java:

Pass-based Processing: Each pass identifies the most frequent kmers in the remaining sequences
Minimal Retention: maxkpp=1 ensures only the single most effective kmer is retained per pass
Progressive Filtering: Sequences containing selected kmers are removed from subsequent passes
Coverage Guarantee: Algorithm continues until every input sequence contains at least one selected kmer

Two-Phase Strategy

Full-Length Phase: Generates initial covering set from complete SSU sequences, capturing major taxonomic groups
Fragment Phase: Processes overlapping shreds to identify kmers distinguishing closely related sequences

SSU-Specific Optimizations

k=31: 31-mer length provides optimal balance between specificity and computational efficiency for SSU sequences
Forward Strand Only: rcomp=f prevents redundant coverage since SSU databases are standardized to forward orientation
Boundary Padding: 31-base padding ensures complete kmer coverage at sequence ends
Overlap Strategy: 30-base overlaps with 150bp fragments create comprehensive kmer contexts

Memory and Performance

Memory Scaling: 8GB allocation handles large Silva databases efficiently
Pass Limitation: maxpasses=200000 prevents infinite loops while allowing thorough processing
Kmer Selection: Greedy selection minimizes set size while maintaining coverage guarantees

Basic Usage

# 1. Ensure input file is present
ls -la ssu_deduped100pct.fa.gz

# 2. Run the covering set generation pipeline
bash pipelines/silva/makeCoveringSetSsu.sh

# 3. Verify outputs
ls -la ssu_covering_31mers.fa ssu_shred_covering_31mers.fa

Expected Inputs and Outputs

Input Files

ssu_deduped100pct.fa.gz - Silva SSU database deduplicated at 100% identity (required)

Output Files

ssu_covering_31mers.fa - Initial covering set from full-length sequences
ssu_deduped100pct_padded.fa.gz - Padded sequences (intermediate file)
shredsSsu.fa.gz - Shredded overlapping fragments (intermediate file)
ssu_shred_covering_31mers.fa - Final comprehensive covering set

Temporary Files

The pipeline generates temporary files during kmerfilterset processing. These are automatically cleaned up upon completion.

Performance Characteristics

Memory Usage

Base Requirement: 8GB RAM allocation for kmer table construction
Peak Usage: Memory usage scales with database size and kmer diversity
Optimization: Forward-strand-only processing reduces memory footprint by ~50%

Execution Time

Stage 1: Initial covering set generation - typically fastest phase
Stage 2: Sequence padding - very fast reformatting operation
Stage 3: Sequence shredding - moderate time, creates many fragments
Stage 4: Comprehensive covering set - longest phase due to maxpasses=200000

Output Size Optimization

Greedy Selection: maxkpp=1 creates near-minimal covering sets
Two-Phase Approach: Balances coverage completeness with set size
Further Compression: Output can be compressed using kcompress.sh for additional space savings

Applications

Taxonomic Classification

The covering set enables rapid taxonomic classification of unknown SSU sequences by guaranteeing that every known SSU sequence will have at least one kmer match in the covering set.

Database Indexing

Covering sets can be used as compact indices for large SSU databases, allowing fast preliminary screening before more detailed alignment-based analysis.

Quality Control

The covering set can identify sequences that may be chimeric or low-quality by checking for expected kmer representation patterns in SSU data.

Technical Details

Kmer Selection Strategy

Frequency-Based Selection: Algorithm prioritizes highly frequent kmers that cover many sequences
Iterative Refinement: Each pass removes covered sequences, focusing on previously uncovered diversity
Minimal Redundancy: Single kmer per pass (maxkpp=1) prevents set inflation

Coverage Guarantee

Complete Coverage: Every input sequence guaranteed to contain ≥1 kmer from final set
Boundary Handling: Sequence padding ensures kmers can be extracted from sequence ends
Fragment Coverage: Shredding creates overlapping contexts for comprehensive kmer representation

SSU-Specific Considerations

Directional Processing: rcomp=f assumes Silva sequences are in standard forward orientation
Variable Regions: Shredding captures kmers from both conserved and variable SSU regions
Phylogenetic Diversity: Two-phase approach ensures representation across SSU phylogenetic breadth

Workflow Integration

Silva Database Preparation

This pipeline is typically used as part of Silva database processing workflows:

Silva database download and formatting
Sequence deduplication at 100% identity
Covering set generation (this pipeline)
Database indexing and deployment

Downstream Applications

SendSketch: Covering sets accelerate taxonomic sketching operations
BBSketch: Can use covering sets for rapid similarity estimation
Taxonomic Servers: Covering sets reduce memory requirements for SSU classification services

Notes and Tips

The pipeline requires the specific input filename "ssu_deduped100pct.fa.gz" to be present in the working directory
Stage 4 is timed to monitor performance - large databases may require substantial processing time
The maxpasses=200000 setting allows for very thorough processing but may not be needed for smaller databases
Output files can be further compressed using kcompress.sh to reduce storage requirements
The covering set quality depends on the input database quality - ensure SSU sequences are properly curated
Memory requirements scale with database size; increase -Xmx values for larger datasets if needed

Related Tools

kmerfilterset.sh - Core tool for generating covering sets from any sequence collection
reformat.sh - Sequence formatting and padding utility
shred.sh - Sequence fragmentation tool for creating overlapping subsequences
filtersilva.sh - Silva database cleaning and filtering
kcompress.sh - Kmer set compression for storage optimization

Troubleshooting

Memory Issues

Increase -Xmx setting if out-of-memory errors occur
Consider processing smaller subsets for very large databases
Monitor system memory usage during Stage 4 processing

Performance Issues

Stage 4 may take significant time for large databases - this is expected
Use the timing output to estimate total runtime
Consider reducing maxpasses for faster processing with potentially larger covering sets

File Issues

Ensure ssu_deduped100pct.fa.gz is properly formatted and accessible
Check disk space for temporary files and output
Verify write permissions in the working directory