makeCoveringSetLsu.sh - BBTools Pipeline

Purpose

This pipeline creates a highly optimized kmer filter set specifically for Large Subunit (23S/28S) ribosomal RNA sequences. The covering set property ensures that any authentic LSU rRNA sequence will match at least one 31-mer in the set, while sequences lacking matches are likely non-ribosomal contaminants or different RNA types. This is essential for Silva database processing and ribosomal RNA classification workflows.

Key Applications

Contamination Detection: Identify non-ribosomal sequences in rRNA datasets
Quality Control: Filter spurious sequences from Silva database processing
Classification: Distinguish LSU rRNA from other sequence types
Database Curation: Support Silva database preparation and validation

Prerequisites

Deduplicated LSU rRNA sequence file: lsu_deduped100pct.fa.gz
High-quality, curated LSU sequences (typically from Silva database)
Sufficient memory (minimum 2GB recommended, 8GB+ for large datasets)
Adequate disk space for intermediate files and shredded sequences

Pipeline Stages

Stage 1: Initial Covering Set Generation

Creates a basic covering set directly from the deduplicated LSU sequences:

kmerfilterset.sh in=lsu_deduped100pct.fa.gz k=31 rcomp=f out=lsu_covering_31mers.fa maxkpp=1

Parameters:

k=31 - Uses 31-mers for optimal specificity and sensitivity balance
rcomp=f - Disables reverse complement to maintain strand specificity
maxkpp=1 - Limits to maximum 1 kmer per sequence position

Note: This stage is commented out in the final pipeline, indicating it may be a preparatory step or alternative approach.

Stage 2: Sequence Padding

Adds padding to sequence ends to ensure complete coverage during shredding:

reformat.sh in=lsu_deduped100pct.fa.gz out=lsu_deduped100pct_padded.fa.gz padleft=31 padright=31 ow

Parameters:

padleft=31 - Adds 31 bases of padding to sequence start
padright=31 - Adds 31 bases of padding to sequence end
ow - Overwrites existing output file

Note: This stage is also commented out, suggesting it's optional or used in alternative workflows.

Stage 3: Sequence Shredding

Fragments padded sequences into overlapping reads to increase kmer diversity:

shred.sh in=lsu_deduped100pct_padded.fa.gz length=150 minlength=62 overlap=30 out=shreds.fa.gz ow

Parameters:

length=150 - Creates fragments of 150 base pairs
minlength=62 - Minimum fragment length (must be ≥ 2×k for 31-mers)
overlap=30 - 30bp overlap between consecutive fragments

Note: This stage is commented out but shows the shredding strategy used.

Stage 4: Preliminary Shred Covering Set

Generates covering set from shredded sequences (preparation step):

kmerfilterset.sh in=shreds.fa.gz initial=lsu_covering_31mers.fa k=31 rcomp=f out=lsu_shred_covering_31mers.fa maxkpp=1

Note: This stage is commented out, indicating it's preparatory for the final optimization.

Stage 5: Optimized Covering Set Generation (Active)

The core pipeline stage that generates the final optimized covering set:

nohup time kmerfilterset.sh in=shreds.fa.gz initial=lsu_shred_covering_31mers_temp.fa k=31 rcomp=f out=lsu_shred_covering_31mers.fa maxkpp=1 maxpasses=40000 fastawrap=99999 -Xmx2g ow

Active Parameters:

in=shreds.fa.gz - Input shredded LSU sequences
initial=lsu_shred_covering_31mers_temp.fa - Starting kmer set for optimization
k=31 - 31-mer length for optimal rRNA classification
rcomp=f - Strand-specific kmers (no reverse complement)
maxkpp=1 - One kmer maximum per sequence position
maxpasses=40000 - Up to 40,000 optimization iterations
fastawrap=99999 - Long line wrapping for compact output
-Xmx2g - 2GB memory allocation
ow - Overwrite existing output

Algorithm Details

Covering Set Generation Strategy

The pipeline employs a sophisticated multi-stage approach to minimize the final kmer set while maintaining complete sequence coverage:

Shredding Strategy

Fragment Generation: 150bp fragments with 30bp overlap creates dense kmer sampling
Coverage Enhancement: Multiple fragments per region increase kmer representation
Boundary Handling: Padding ensures sequence ends are properly represented
Minimum Length: 62bp minimum ensures at least 32 unique 31-mers per fragment

Iterative Optimization

Greedy Selection: Each pass identifies most frequent uncovered kmers
Coverage Tracking: Removes sequences matched by selected kmers
Convergence: Continues until all sequences are covered or maxpasses reached
Memory Efficiency: 2GB allocation handles large Silva LSU datasets

31-mer Selection Rationale

The choice of 31-mers for LSU rRNA classification balances specificity and coverage:

Specificity: 31bp provides 4^31 possible sequences (2.4×10^18), ensuring high discrimination
Conservation: LSU rRNA has conserved regions that maintain 31-mer matches across species
Variable Regions: Sufficient length to span variable regions without over-fragmentation
Memory Efficiency: Manageable memory footprint for large-scale processing

Covering Set Properties

The resulting kmer set has these mathematical properties:

Covering Property: Every input LSU sequence contains ≥1 kmer from the set
Minimality: Iterative optimization reduces redundancy while maintaining coverage
Strand Specificity: No reverse complement ensures directional sensitivity
Position Independence: maxkpp=1 prevents positional bias in kmer selection

Input Requirements

Required Input File

lsu_deduped100pct.fa.gz: Deduplicated Large Subunit rRNA sequences

Must be in FASTA format (gzipped acceptable)
Should contain high-quality, curated LSU rRNA sequences
Typically sourced from Silva database after deduplication
100% identity clustering removes redundant sequences
Represents the full diversity of LSU rRNA sequences to be covered

File Preparation

Before running this pipeline, ensure you have:

Downloaded Silva LSU rRNA database
Performed 100% identity deduplication to remove exact duplicates
Quality filtered to remove partial or low-quality sequences
Converted to compressed FASTA format for efficiency

Output Files

Primary Output

lsu_shred_covering_31mers.fa: Final optimized covering set

Contains minimal set of 31-mers that cover all input LSU sequences
FASTA format with each kmer as a separate sequence
Optimized through iterative refinement for minimal size
Ready for use with BBDuk, BBMap, or other BBTools for classification

Intermediate Files (from commented stages)

When stages 1-4 are active, additional intermediate files are generated:

lsu_covering_31mers.fa - Initial covering set from full sequences
lsu_deduped100pct_padded.fa.gz - Padded sequences for edge coverage
shreds.fa.gz - Shredded sequence fragments for optimization
lsu_shred_covering_31mers_temp.fa - Temporary optimization input

Usage Example

# Prepare Silva LSU data (prerequisite steps)
# wget Silva LSU database and perform 100% deduplication
# Result should be: lsu_deduped100pct.fa.gz

# Run the pipeline (execute in directory containing input file)
cd /path/to/silva/data
nohup /path/to/bbtools/pipelines/silva/makeCoveringSetLsu.sh &

# Monitor progress
tail -f nohup.out

# Use the generated covering set for classification
bbduk.sh in=unknown_sequences.fa ref=lsu_shred_covering_31mers.fa k=31 stats=lsu_matches.txt

Pipeline Configuration

Memory Management

The pipeline is configured for efficient memory usage:

Memory Allocation: 2GB (-Xmx2g) for the optimization stage
Processing Mode: Runs with nohup for long-running execution
Timing: Includes time measurement for performance monitoring
Background Execution: Designed for unattended operation

Optimization Parameters

maxpasses=40000: Extensive iteration limit for thorough optimization
fastawrap=99999: Minimal line wrapping for compact kmer storage
maxkpp=1: Prevents kmer redundancy within sequences
ow: Automatic overwrite of existing output files

Performance Characteristics

Processing Time

Initial Pass: Minutes to hours depending on LSU dataset size
Optimization: May run for several hours to achieve minimal set
Convergence: Algorithm stops when no further reduction is possible
Scaling: Time increases with sequence diversity and dataset size

Memory Requirements

Base Requirement: 2GB for the optimization process
Input Buffering: Additional memory for compressed file processing
Kmer Storage: Memory scales with intermediate kmer set size
Temporary Data: Space for tracking sequence coverage during optimization

Output Size Expectations

Initial Set: May contain hundreds of thousands of kmers
Optimized Set: Typically reduces to 10-50% of initial size
Final Size: Depends on LSU sequence diversity in input dataset
Coverage: Maintains 100% coverage of input sequences despite reduction

Implementation Details

Shredding Strategy (Commented Stages)

When enabled, the shredding approach enhances kmer set optimization:

Padding Strategy: 31bp padding on both ends ensures edge kmers are captured
Fragment Size: 150bp fragments provide 120 unique 31-mers per fragment
Overlap Strategy: 30bp overlap ensures continuous coverage across fragment boundaries
Minimum Length: 62bp minimum allows at least 32 unique 31-mers

Kmer Selection Algorithm

The optimization process uses a sophisticated greedy approach:

Frequency Analysis: Identifies most common uncovered kmers in each pass
Coverage Tracking: Maintains set of sequences covered by selected kmers
Iterative Refinement: Removes covered sequences to focus on gaps
Convergence Detection: Stops when all sequences are covered or no progress is made

Silva Database Integration

This pipeline is specifically designed for Silva LSU processing:

Database Compatibility: Works with Silva LSU reference sequences
Quality Standards: Assumes high-quality, manually curated input
Taxonomic Coverage: Maintains representation across taxonomic groups
Version Tracking: Output sets are tied to specific Silva releases

Comparison with SSU Pipeline

This pipeline parallels makeCoveringSetSsu.sh with key differences:

Similarities

Same kmer length (k=31) for consistent resolution
Identical optimization parameters (maxpasses, maxkpp)
Same shredding strategy when enabled
Equivalent output format and structure

Differences

Target Sequences: LSU (23S/28S) vs SSU (16S/18S) rRNA
Memory Allocation: 2GB vs 8GB (LSU typically smaller dataset)
Iteration Limits: 40,000 vs 200,000 passes (different convergence expectations)
Output Naming: lsu_* vs ssu_* prefixes for file organization

Quality Control

Validation Steps

To verify the covering set quality:

# Test coverage on original sequences
bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsu_shred_covering_31mers.fa k=31 stats=coverage_test.txt

# Check for uncovered sequences (should be zero)
bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsu_shred_covering_31mers.fa k=31 outm=covered.fa outu=uncovered.fa

# Verify file size and kmer count
stats.sh in=lsu_shred_covering_31mers.fa

Expected Results

Coverage: 100% of input LSU sequences should match at least one kmer
Efficiency: Covering set should be significantly smaller than exhaustive kmer enumeration
Specificity: Minimal false positives when tested against non-LSU sequences
Completeness: No legitimate LSU sequences should be uncovered

Troubleshooting

Common Issues

Memory Errors: Increase -Xmx allocation if processing fails
Long Runtime: High maxpasses may cause extended execution time
Large Output: Diverse input sequences may produce large covering sets
Disk Space: Shredding can generate large intermediate files

Performance Tuning

Memory: Increase -Xmx2g to -Xmx8g for large datasets
Iterations: Reduce maxpasses=40000 for faster completion
Preprocessing: Enable commented stages for enhanced optimization
Monitoring: Use nohup.out to track optimization progress

Integration with Other Tools

Downstream Applications

The generated covering set can be used with various BBTools:

BBDuk: Filter or classify sequences based on LSU matches
BBMap: Reference-guided mapping with LSU specificity
BBSketch: Taxonomic classification enhanced with LSU markers
Seal: Sequence extraction based on LSU kmer presence

Pipeline Integration

# Example: LSU contamination removal pipeline
bbduk.sh in=mixed_sequences.fa ref=lsu_shred_covering_31mers.fa k=31 outm=lsu_matches.fa outu=non_lsu.fa

# Example: LSU identification in metagenomes  
bbduk.sh in=metagenomic_reads.fq ref=lsu_shred_covering_31mers.fa k=31 stats=lsu_classification.txt

Related Tools

kmerfilterset.sh - Core tool for generating covering kmer sets
reformat.sh - Sequence padding and formatting for pipeline preparation
shred.sh - Sequence fragmentation for enhanced kmer sampling
makeCoveringSetSsu.sh - Parallel pipeline for Small Subunit (SSU) rRNA
filtersilva.sh - Silva database filtering and preparation
reducesilva.sh - Silva database size reduction and optimization

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Silva Database: Silva Project