makeCoveringSetLsu.sh
Generates a minimal covering set of 31-mers for Large Subunit (LSU) ribosomal RNA sequences from Silva database. Creates an optimized kmer set where every LSU sequence contains at least one kmer from the set, enabling efficient sequence classification and contamination filtering. The pipeline uses shredding and iterative optimization to minimize the final kmer set size.
Purpose
This pipeline creates a highly optimized kmer filter set specifically for Large Subunit (23S/28S) ribosomal RNA sequences. The covering set property ensures that any authentic LSU rRNA sequence will match at least one 31-mer in the set, while sequences lacking matches are likely non-ribosomal contaminants or different RNA types. This is essential for Silva database processing and ribosomal RNA classification workflows.
Key Applications
- Contamination Detection: Identify non-ribosomal sequences in rRNA datasets
- Quality Control: Filter spurious sequences from Silva database processing
- Classification: Distinguish LSU rRNA from other sequence types
- Database Curation: Support Silva database preparation and validation
Prerequisites
- Deduplicated LSU rRNA sequence file:
lsu_deduped100pct.fa.gz
- High-quality, curated LSU sequences (typically from Silva database)
- Sufficient memory (minimum 2GB recommended, 8GB+ for large datasets)
- Adequate disk space for intermediate files and shredded sequences
Pipeline Stages
Stage 1: Initial Covering Set Generation
Creates a basic covering set directly from the deduplicated LSU sequences:
kmerfilterset.sh in=lsu_deduped100pct.fa.gz k=31 rcomp=f out=lsu_covering_31mers.fa maxkpp=1
Parameters:
k=31
- Uses 31-mers for optimal specificity and sensitivity balancercomp=f
- Disables reverse complement to maintain strand specificitymaxkpp=1
- Limits to maximum 1 kmer per sequence position
Note: This stage is commented out in the final pipeline, indicating it may be a preparatory step or alternative approach.
Stage 2: Sequence Padding
Adds padding to sequence ends to ensure complete coverage during shredding:
reformat.sh in=lsu_deduped100pct.fa.gz out=lsu_deduped100pct_padded.fa.gz padleft=31 padright=31 ow
Parameters:
padleft=31
- Adds 31 bases of padding to sequence startpadright=31
- Adds 31 bases of padding to sequence endow
- Overwrites existing output file
Note: This stage is also commented out, suggesting it's optional or used in alternative workflows.
Stage 3: Sequence Shredding
Fragments padded sequences into overlapping reads to increase kmer diversity:
shred.sh in=lsu_deduped100pct_padded.fa.gz length=150 minlength=62 overlap=30 out=shreds.fa.gz ow
Parameters:
length=150
- Creates fragments of 150 base pairsminlength=62
- Minimum fragment length (must be ≥ 2×k for 31-mers)overlap=30
- 30bp overlap between consecutive fragments
Note: This stage is commented out but shows the shredding strategy used.
Stage 4: Preliminary Shred Covering Set
Generates covering set from shredded sequences (preparation step):
kmerfilterset.sh in=shreds.fa.gz initial=lsu_covering_31mers.fa k=31 rcomp=f out=lsu_shred_covering_31mers.fa maxkpp=1
Note: This stage is commented out, indicating it's preparatory for the final optimization.
Stage 5: Optimized Covering Set Generation (Active)
The core pipeline stage that generates the final optimized covering set:
nohup time kmerfilterset.sh in=shreds.fa.gz initial=lsu_shred_covering_31mers_temp.fa k=31 rcomp=f out=lsu_shred_covering_31mers.fa maxkpp=1 maxpasses=40000 fastawrap=99999 -Xmx2g ow
Active Parameters:
in=shreds.fa.gz
- Input shredded LSU sequencesinitial=lsu_shred_covering_31mers_temp.fa
- Starting kmer set for optimizationk=31
- 31-mer length for optimal rRNA classificationrcomp=f
- Strand-specific kmers (no reverse complement)maxkpp=1
- One kmer maximum per sequence positionmaxpasses=40000
- Up to 40,000 optimization iterationsfastawrap=99999
- Long line wrapping for compact output-Xmx2g
- 2GB memory allocationow
- Overwrite existing output
Algorithm Details
Covering Set Generation Strategy
The pipeline employs a sophisticated multi-stage approach to minimize the final kmer set while maintaining complete sequence coverage:
Shredding Strategy
- Fragment Generation: 150bp fragments with 30bp overlap creates dense kmer sampling
- Coverage Enhancement: Multiple fragments per region increase kmer representation
- Boundary Handling: Padding ensures sequence ends are properly represented
- Minimum Length: 62bp minimum ensures at least 32 unique 31-mers per fragment
Iterative Optimization
- Greedy Selection: Each pass identifies most frequent uncovered kmers
- Coverage Tracking: Removes sequences matched by selected kmers
- Convergence: Continues until all sequences are covered or maxpasses reached
- Memory Efficiency: 2GB allocation handles large Silva LSU datasets
31-mer Selection Rationale
The choice of 31-mers for LSU rRNA classification balances specificity and coverage:
- Specificity: 31bp provides 4^31 possible sequences (2.4×10^18), ensuring high discrimination
- Conservation: LSU rRNA has conserved regions that maintain 31-mer matches across species
- Variable Regions: Sufficient length to span variable regions without over-fragmentation
- Memory Efficiency: Manageable memory footprint for large-scale processing
Covering Set Properties
The resulting kmer set has these mathematical properties:
- Covering Property: Every input LSU sequence contains ≥1 kmer from the set
- Minimality: Iterative optimization reduces redundancy while maintaining coverage
- Strand Specificity: No reverse complement ensures directional sensitivity
- Position Independence: maxkpp=1 prevents positional bias in kmer selection
Input Requirements
Required Input File
lsu_deduped100pct.fa.gz: Deduplicated Large Subunit rRNA sequences
- Must be in FASTA format (gzipped acceptable)
- Should contain high-quality, curated LSU rRNA sequences
- Typically sourced from Silva database after deduplication
- 100% identity clustering removes redundant sequences
- Represents the full diversity of LSU rRNA sequences to be covered
File Preparation
Before running this pipeline, ensure you have:
- Downloaded Silva LSU rRNA database
- Performed 100% identity deduplication to remove exact duplicates
- Quality filtered to remove partial or low-quality sequences
- Converted to compressed FASTA format for efficiency
Output Files
Primary Output
lsu_shred_covering_31mers.fa: Final optimized covering set
- Contains minimal set of 31-mers that cover all input LSU sequences
- FASTA format with each kmer as a separate sequence
- Optimized through iterative refinement for minimal size
- Ready for use with BBDuk, BBMap, or other BBTools for classification
Intermediate Files (from commented stages)
When stages 1-4 are active, additional intermediate files are generated:
lsu_covering_31mers.fa
- Initial covering set from full sequenceslsu_deduped100pct_padded.fa.gz
- Padded sequences for edge coverageshreds.fa.gz
- Shredded sequence fragments for optimizationlsu_shred_covering_31mers_temp.fa
- Temporary optimization input
Usage Example
# Prepare Silva LSU data (prerequisite steps)
# wget Silva LSU database and perform 100% deduplication
# Result should be: lsu_deduped100pct.fa.gz
# Run the pipeline (execute in directory containing input file)
cd /path/to/silva/data
nohup /path/to/bbtools/pipelines/silva/makeCoveringSetLsu.sh &
# Monitor progress
tail -f nohup.out
# Use the generated covering set for classification
bbduk.sh in=unknown_sequences.fa ref=lsu_shred_covering_31mers.fa k=31 stats=lsu_matches.txt
Pipeline Configuration
Memory Management
The pipeline is configured for efficient memory usage:
- Memory Allocation: 2GB (-Xmx2g) for the optimization stage
- Processing Mode: Runs with nohup for long-running execution
- Timing: Includes time measurement for performance monitoring
- Background Execution: Designed for unattended operation
Optimization Parameters
- maxpasses=40000: Extensive iteration limit for thorough optimization
- fastawrap=99999: Minimal line wrapping for compact kmer storage
- maxkpp=1: Prevents kmer redundancy within sequences
- ow: Automatic overwrite of existing output files
Performance Characteristics
Processing Time
- Initial Pass: Minutes to hours depending on LSU dataset size
- Optimization: May run for several hours to achieve minimal set
- Convergence: Algorithm stops when no further reduction is possible
- Scaling: Time increases with sequence diversity and dataset size
Memory Requirements
- Base Requirement: 2GB for the optimization process
- Input Buffering: Additional memory for compressed file processing
- Kmer Storage: Memory scales with intermediate kmer set size
- Temporary Data: Space for tracking sequence coverage during optimization
Output Size Expectations
- Initial Set: May contain hundreds of thousands of kmers
- Optimized Set: Typically reduces to 10-50% of initial size
- Final Size: Depends on LSU sequence diversity in input dataset
- Coverage: Maintains 100% coverage of input sequences despite reduction
Implementation Details
Shredding Strategy (Commented Stages)
When enabled, the shredding approach enhances kmer set optimization:
- Padding Strategy: 31bp padding on both ends ensures edge kmers are captured
- Fragment Size: 150bp fragments provide 120 unique 31-mers per fragment
- Overlap Strategy: 30bp overlap ensures continuous coverage across fragment boundaries
- Minimum Length: 62bp minimum allows at least 32 unique 31-mers
Kmer Selection Algorithm
The optimization process uses a sophisticated greedy approach:
- Frequency Analysis: Identifies most common uncovered kmers in each pass
- Coverage Tracking: Maintains set of sequences covered by selected kmers
- Iterative Refinement: Removes covered sequences to focus on gaps
- Convergence Detection: Stops when all sequences are covered or no progress is made
Silva Database Integration
This pipeline is specifically designed for Silva LSU processing:
- Database Compatibility: Works with Silva LSU reference sequences
- Quality Standards: Assumes high-quality, manually curated input
- Taxonomic Coverage: Maintains representation across taxonomic groups
- Version Tracking: Output sets are tied to specific Silva releases
Comparison with SSU Pipeline
This pipeline parallels makeCoveringSetSsu.sh with key differences:
Similarities
- Same kmer length (k=31) for consistent resolution
- Identical optimization parameters (maxpasses, maxkpp)
- Same shredding strategy when enabled
- Equivalent output format and structure
Differences
- Target Sequences: LSU (23S/28S) vs SSU (16S/18S) rRNA
- Memory Allocation: 2GB vs 8GB (LSU typically smaller dataset)
- Iteration Limits: 40,000 vs 200,000 passes (different convergence expectations)
- Output Naming: lsu_* vs ssu_* prefixes for file organization
Quality Control
Validation Steps
To verify the covering set quality:
# Test coverage on original sequences
bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsu_shred_covering_31mers.fa k=31 stats=coverage_test.txt
# Check for uncovered sequences (should be zero)
bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsu_shred_covering_31mers.fa k=31 outm=covered.fa outu=uncovered.fa
# Verify file size and kmer count
stats.sh in=lsu_shred_covering_31mers.fa
Expected Results
- Coverage: 100% of input LSU sequences should match at least one kmer
- Efficiency: Covering set should be significantly smaller than exhaustive kmer enumeration
- Specificity: Minimal false positives when tested against non-LSU sequences
- Completeness: No legitimate LSU sequences should be uncovered
Troubleshooting
Common Issues
- Memory Errors: Increase -Xmx allocation if processing fails
- Long Runtime: High maxpasses may cause extended execution time
- Large Output: Diverse input sequences may produce large covering sets
- Disk Space: Shredding can generate large intermediate files
Performance Tuning
- Memory: Increase -Xmx2g to -Xmx8g for large datasets
- Iterations: Reduce maxpasses=40000 for faster completion
- Preprocessing: Enable commented stages for enhanced optimization
- Monitoring: Use nohup.out to track optimization progress
Integration with Other Tools
Downstream Applications
The generated covering set can be used with various BBTools:
- BBDuk: Filter or classify sequences based on LSU matches
- BBMap: Reference-guided mapping with LSU specificity
- BBSketch: Taxonomic classification enhanced with LSU markers
- Seal: Sequence extraction based on LSU kmer presence
Pipeline Integration
# Example: LSU contamination removal pipeline
bbduk.sh in=mixed_sequences.fa ref=lsu_shred_covering_31mers.fa k=31 outm=lsu_matches.fa outu=non_lsu.fa
# Example: LSU identification in metagenomes
bbduk.sh in=metagenomic_reads.fq ref=lsu_shred_covering_31mers.fa k=31 stats=lsu_classification.txt
Related Tools
- kmerfilterset.sh - Core tool for generating covering kmer sets
- reformat.sh - Sequence padding and formatting for pipeline preparation
- shred.sh - Sequence fragmentation for enhanced kmer sampling
- makeCoveringSetSsu.sh - Parallel pipeline for Small Subunit (SSU) rRNA
- filtersilva.sh - Silva database filtering and preparation
- reducesilva.sh - Silva database size reduction and optimization
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Silva Database: Silva Project