makeCoveringSetLsu.sh

Script: makeCoveringSetLsu.sh Source Directory: pipelines/silva/ Author: Brian Bushnell

Generates a minimal covering set of 31-mers for Large Subunit (LSU) ribosomal RNA sequences from Silva database. Creates an optimized kmer set where every LSU sequence contains at least one kmer from the set, enabling efficient sequence classification and contamination filtering. The pipeline uses shredding and iterative optimization to minimize the final kmer set size.

Purpose

This pipeline creates a highly optimized kmer filter set specifically for Large Subunit (23S/28S) ribosomal RNA sequences. The covering set property ensures that any authentic LSU rRNA sequence will match at least one 31-mer in the set, while sequences lacking matches are likely non-ribosomal contaminants or different RNA types. This is essential for Silva database processing and ribosomal RNA classification workflows.

Key Applications

Prerequisites

Pipeline Stages

Stage 1: Initial Covering Set Generation

Creates a basic covering set directly from the deduplicated LSU sequences:

kmerfilterset.sh in=lsu_deduped100pct.fa.gz k=31 rcomp=f out=lsu_covering_31mers.fa maxkpp=1

Parameters:

Note: This stage is commented out in the final pipeline, indicating it may be a preparatory step or alternative approach.

Stage 2: Sequence Padding

Adds padding to sequence ends to ensure complete coverage during shredding:

reformat.sh in=lsu_deduped100pct.fa.gz out=lsu_deduped100pct_padded.fa.gz padleft=31 padright=31 ow

Parameters:

Note: This stage is also commented out, suggesting it's optional or used in alternative workflows.

Stage 3: Sequence Shredding

Fragments padded sequences into overlapping reads to increase kmer diversity:

shred.sh in=lsu_deduped100pct_padded.fa.gz length=150 minlength=62 overlap=30 out=shreds.fa.gz ow

Parameters:

Note: This stage is commented out but shows the shredding strategy used.

Stage 4: Preliminary Shred Covering Set

Generates covering set from shredded sequences (preparation step):

kmerfilterset.sh in=shreds.fa.gz initial=lsu_covering_31mers.fa k=31 rcomp=f out=lsu_shred_covering_31mers.fa maxkpp=1

Note: This stage is commented out, indicating it's preparatory for the final optimization.

Stage 5: Optimized Covering Set Generation (Active)

The core pipeline stage that generates the final optimized covering set:

nohup time kmerfilterset.sh in=shreds.fa.gz initial=lsu_shred_covering_31mers_temp.fa k=31 rcomp=f out=lsu_shred_covering_31mers.fa maxkpp=1 maxpasses=40000 fastawrap=99999 -Xmx2g ow

Active Parameters:

Algorithm Details

Covering Set Generation Strategy

The pipeline employs a sophisticated multi-stage approach to minimize the final kmer set while maintaining complete sequence coverage:

Shredding Strategy

Iterative Optimization

31-mer Selection Rationale

The choice of 31-mers for LSU rRNA classification balances specificity and coverage:

Covering Set Properties

The resulting kmer set has these mathematical properties:

Input Requirements

Required Input File

lsu_deduped100pct.fa.gz: Deduplicated Large Subunit rRNA sequences

File Preparation

Before running this pipeline, ensure you have:

Output Files

Primary Output

lsu_shred_covering_31mers.fa: Final optimized covering set

Intermediate Files (from commented stages)

When stages 1-4 are active, additional intermediate files are generated:

Usage Example

# Prepare Silva LSU data (prerequisite steps)
# wget Silva LSU database and perform 100% deduplication
# Result should be: lsu_deduped100pct.fa.gz

# Run the pipeline (execute in directory containing input file)
cd /path/to/silva/data
nohup /path/to/bbtools/pipelines/silva/makeCoveringSetLsu.sh &

# Monitor progress
tail -f nohup.out

# Use the generated covering set for classification
bbduk.sh in=unknown_sequences.fa ref=lsu_shred_covering_31mers.fa k=31 stats=lsu_matches.txt

Pipeline Configuration

Memory Management

The pipeline is configured for efficient memory usage:

Optimization Parameters

Performance Characteristics

Processing Time

Memory Requirements

Output Size Expectations

Implementation Details

Shredding Strategy (Commented Stages)

When enabled, the shredding approach enhances kmer set optimization:

Kmer Selection Algorithm

The optimization process uses a sophisticated greedy approach:

Silva Database Integration

This pipeline is specifically designed for Silva LSU processing:

Comparison with SSU Pipeline

This pipeline parallels makeCoveringSetSsu.sh with key differences:

Similarities

Differences

Quality Control

Validation Steps

To verify the covering set quality:

# Test coverage on original sequences
bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsu_shred_covering_31mers.fa k=31 stats=coverage_test.txt

# Check for uncovered sequences (should be zero)
bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsu_shred_covering_31mers.fa k=31 outm=covered.fa outu=uncovered.fa

# Verify file size and kmer count
stats.sh in=lsu_shred_covering_31mers.fa

Expected Results

Troubleshooting

Common Issues

Performance Tuning

Integration with Other Tools

Downstream Applications

The generated covering set can be used with various BBTools:

Pipeline Integration

# Example: LSU contamination removal pipeline
bbduk.sh in=mixed_sequences.fa ref=lsu_shred_covering_31mers.fa k=31 outm=lsu_matches.fa outu=non_lsu.fa

# Example: LSU identification in metagenomes  
bbduk.sh in=metagenomic_reads.fq ref=lsu_shred_covering_31mers.fa k=31 stats=lsu_classification.txt

Related Tools

Support

For questions and support: