SSU Covering Set Generator

Script: makeCoveringSetSsu.sh Source Directory: pipelines/silva/ Author: Brian Bushnell

Specialized pipeline for generating a minimal kmer covering set from Silva SSU (Small Subunit) ribosomal RNA sequences. This creates an optimized reference set where every input SSU sequence contains at least one kmer from the covering set, enabling efficient taxonomic classification and database queries.

Overview

This pipeline is specifically designed for processing Silva SSU ribosomal RNA databases to create a minimal covering set of 31-mers. The covering set ensures that every SSU sequence in the database contains at least one kmer from the set, which dramatically improves the speed of taxonomic classification and sequence matching operations.

The pipeline uses a greedy algorithm approach that first generates an initial covering set from the complete sequences, then creates shredded fragments to fill coverage gaps, ensuring comprehensive representation of all SSU diversity.

Note: This pipeline is designed to work with Silva database files and expects input in the specific format "ssu_deduped100pct.fa.gz" containing deduplicated SSU sequences at 100% identity clustering.

Prerequisites

System Requirements

Input Requirements

Pipeline Stages

Stage 1: Initial Covering Set Generation

kmerfilterset.sh in=ssu_deduped100pct.fa.gz k=31 rcomp=f out=ssu_covering_31mers.fa maxkpp=1 -Xmx8g

Purpose: Generates an initial minimal covering set of 31-mers from the complete SSU database.

Key Parameters:

Algorithm: Uses a greedy approach that selects the most frequent kmers in each pass, removes sequences containing those kmers, and repeats until all sequences are covered. This creates a compact initial covering set.

Stage 2: Sequence Padding

reformat.sh in=ssu_deduped100pct.fa.gz out=ssu_deduped100pct_padded.fa.gz padleft=31 padright=31 ow

Purpose: Adds 31 bases of padding to both ends of each sequence to ensure complete kmer coverage at sequence boundaries.

Key Parameters:

Rationale: Padding ensures that kmers can be generated from the very ends of sequences, preventing coverage gaps at sequence boundaries that could reduce the effectiveness of the covering set.

Stage 3: Sequence Shredding

shred.sh in=ssu_deduped100pct_padded.fa.gz length=150 minlength=62 overlap=30 out=shredsSsu.fa.gz ow

Purpose: Breaks the padded sequences into overlapping fragments to create additional kmer combinations and improve covering set completeness.

Key Parameters:

Strategy: The overlapping shreds create additional kmer contexts that may not be present in the original full-length sequences, helping to identify kmers that can distinguish between closely related SSU sequences.

Stage 4: Comprehensive Covering Set Generation

time kmerfilterset.sh in=shredsSsu.fa.gz initial=ssu_covering_31mers.fa k=31 rcomp=f out=ssu_shred_covering_31mers.fa maxkpp=1 maxpasses=200000 fastawrap=99999 -Xmx8g ow

Purpose: Creates the final comprehensive covering set by processing the shredded sequences and incorporating the initial covering set.

Key Parameters:

Algorithm: Iteratively processes shredded sequences using the greedy covering set algorithm, starting with the initial covering set and adding kmers until every shredded fragment contains at least one kmer from the final set.

Algorithm Details

The SSU covering set generation employs a sophisticated two-phase approach optimized for ribosomal RNA diversity:

Greedy Covering Set Algorithm

The core algorithm uses a greedy approach implemented in KmerFilterSetMaker.java:

Two-Phase Strategy

  1. Full-Length Phase: Generates initial covering set from complete SSU sequences, capturing major taxonomic groups
  2. Fragment Phase: Processes overlapping shreds to identify kmers distinguishing closely related sequences

SSU-Specific Optimizations

Memory and Performance

Basic Usage

# 1. Ensure input file is present
ls -la ssu_deduped100pct.fa.gz

# 2. Run the covering set generation pipeline
bash pipelines/silva/makeCoveringSetSsu.sh

# 3. Verify outputs
ls -la ssu_covering_31mers.fa ssu_shred_covering_31mers.fa

Expected Inputs and Outputs

Input Files

Output Files

Temporary Files

The pipeline generates temporary files during kmerfilterset processing. These are automatically cleaned up upon completion.

Performance Characteristics

Memory Usage

Execution Time

Output Size Optimization

Applications

Taxonomic Classification

The covering set enables rapid taxonomic classification of unknown SSU sequences by guaranteeing that every known SSU sequence will have at least one kmer match in the covering set.

Database Indexing

Covering sets can be used as compact indices for large SSU databases, allowing fast preliminary screening before more detailed alignment-based analysis.

Quality Control

The covering set can identify sequences that may be chimeric or low-quality by checking for expected kmer representation patterns in SSU data.

Technical Details

Kmer Selection Strategy

Coverage Guarantee

SSU-Specific Considerations

Workflow Integration

Silva Database Preparation

This pipeline is typically used as part of Silva database processing workflows:

  1. Silva database download and formatting
  2. Sequence deduplication at 100% identity
  3. Covering set generation (this pipeline)
  4. Database indexing and deployment

Downstream Applications

Notes and Tips

Related Tools

Troubleshooting

Memory Issues

Performance Issues

File Issues