make15mers.sh
Specialized Silva database processing pipeline for creating optimized 15-mer k-mer sets that can distinguish between ribosomal RNA subunits (23S/LSU and 16S/SSU). Uses differential k-mer frequency analysis and set subtraction to create subunit-specific k-mer references.
Overview
This pipeline processes deduplicated Silva ribosomal RNA databases to create highly specific 15-mer k-mer sets that can distinguish between large subunit (23S/LSU) and small subunit (16S/SSU) ribosomal RNAs. The approach uses differential minimum count thresholds and set subtraction to isolate subunit-specific k-mers while removing shared sequences that would cause classification ambiguity.
Prerequisites
Required Input Files
- lsu_deduped100pct.fa.gz - 100% deduplicated large subunit (23S) Silva sequences
- ssu_deduped100pct.fa.gz - 100% deduplicated small subunit (16S) Silva sequences
System Requirements
- BBTools suite with kmercountexact, bbduk, and kcompress
- Sufficient memory for k-mer counting on large Silva databases
- Adequate disk space for intermediate k-mer sets and final compressed outputs
Pipeline Strategy
The pipeline uses a sophisticated differential k-mer analysis approach:
Core Principle
- Generate initial k-mer sets from each subunit using high minimum count thresholds
- Create complementary sets from the opposite subunit using lower thresholds
- Subtract shared k-mers to isolate subunit-specific sequences
- Validate coverage against original sequences to ensure specificity
- Compress final sets for optimal storage and query performance
Threshold Strategy
- 23S Processing: Uses mincount=5000 for initial set, then subtracts SSU k-mers with mincount=10
- 16S Processing: Uses mincount=5000 for initial set, then subtracts LSU k-mers with mincount=8
- Asymmetric thresholds: Different subtraction thresholds (8 vs 10) account for natural abundance differences between subunits
Pipeline Stages
Stage 1: 23S/LSU K-mer Set Generation
1.1 Initial LSU K-mer Extraction
kmercountexact.sh in=lsu_deduped100pct.fa.gz mincount=5000 out=lsu_mincount5000_k15.fa k=15 rcomp=f ow
Extracts 15-mers from LSU sequences that appear at least 5000 times. The high threshold ensures only highly conserved LSU-specific regions are captured. Note: rcomp=f
disables reverse-complement processing to maintain directional specificity.
1.2 Initial Coverage Validation
bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsu_mincount5000_k15.fa k=15 mm=f rcomp=f
Tests how well the initial k-mer set covers the original LSU sequences. The mm=f
flag requires exact matches.
1.3 SSU Contamination K-mer Set
kmercountexact.sh in=ssu_deduped100pct.fa.gz mincount=10 out=ssu_mincount10_k15.fa k=15 rcomp=f ow
Creates a k-mer set from SSU sequences using a much lower threshold (mincount=10) to capture k-mers that might cause cross-contamination with LSU classification.
1.4 Set Subtraction for LSU Specificity
bbduk.sh ref=ssu_mincount10_k15.fa in=lsu_mincount5000_k15.fa k=15 mm=f outm=shared_10_k15.fa outu=lsuonly_10_k15.fa ow
Critical step: removes LSU k-mers that also appear in SSU sequences. This creates a truly LSU-specific k-mer set by:
outm=shared_10_k15.fa
- K-mers found in both subunits (discarded)outu=lsuonly_10_k15.fa
- K-mers specific to LSU (kept)
1.5 Specificity Validation
bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsuonly_10_k15.fa k=15 mm=f rcomp=f
Validates that the LSU-specific k-mer set still provides good coverage of original LSU sequences after subtraction.
1.6 Final LSU Compression
kcompress.sh in=lsuonly_10_k15.fa k=15 out=lsu_15mers.fa.gz ow zl=11 rcomp=f
Compresses the LSU-specific k-mer set with maximum compression level (zl=11) for efficient storage and faster loading.
1.7 Final Coverage Test
bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsu_15mers.fa.gz k=15 mm=f rcomp=f
Final validation that the compressed LSU k-mer set maintains good coverage of the original sequences.
Stage 2: 16S/SSU K-mer Set Generation
2.1 Initial SSU K-mer Extraction
kmercountexact.sh in=ssu_deduped100pct.fa.gz mincount=5000 out=ssu_mincount5000_k15.fa k=15 rcomp=f ow
Extracts 15-mers from SSU sequences using the same high threshold (5000) as LSU processing to ensure only highly conserved SSU regions are captured.
2.2 Initial Coverage Validation
bbduk.sh in=ssu_deduped100pct.fa.gz ref=ssu_mincount5000_k15.fa k=15 mm=f rcomp=f
Validates coverage of the initial SSU k-mer set against original sequences.
2.3 LSU Contamination K-mer Set
kmercountexact.sh in=lsu_deduped100pct.fa.gz mincount=8 out=lsu_mincount8_k15.fa k=15 rcomp=f ow
Creates LSU k-mers using an even lower threshold (mincount=8) than used for SSU contamination detection, reflecting the different abundance patterns between subunits.
2.4 Set Subtraction for SSU Specificity
bbduk.sh ref=lsu_mincount8_k15.fa in=ssu_mincount5000_k15.fa k=15 mm=f outm=shared_8_k15.fa outu=ssuonly_8_k15.fa ow
Removes SSU k-mers that also appear in LSU sequences, creating a truly SSU-specific k-mer set:
outm=shared_8_k15.fa
- K-mers found in both subunits (discarded)outu=ssuonly_8_k15.fa
- K-mers specific to SSU (kept)
2.5 Specificity Validation
bbduk.sh in=ssu_deduped100pct.fa.gz ref=ssuonly_8_k15.fa k=15 mm=f rcomp=f
Confirms that the SSU-specific k-mer set maintains good coverage after subtraction.
2.6 Final SSU Compression
kcompress.sh in=ssuonly_8_k15.fa k=15 out=ssu_15mers.fa.gz ow zl=11 rcomp=f
Compresses the SSU-specific k-mer set with maximum compression for efficient storage.
2.7 Final Coverage Test
bbduk.sh in=ssu_deduped100pct.fa.gz ref=ssu_15mers.fa.gz k=15 mm=f rcomp=f
Final validation of the compressed SSU k-mer set coverage.
Stage 3: Cross-Validation Testing
3.1 LSU vs SSU K-mer Test
bbduk.sh in=lsu_deduped100pct.fa.gz ref=ssu_15mers.fa.gz k=15 mm=f rcomp=f
Critical validation: tests LSU sequences against the SSU k-mer set. Ideally, this should show minimal matches, confirming the k-mer sets are truly subunit-specific.
3.2 SSU vs LSU K-mer Test
bbduk.sh in=ssu_deduped100pct.fa.gz ref=lsu_15mers.fa.gz k=15 mm=f rcomp=f
Reciprocal test: validates SSU sequences against the LSU k-mer set to confirm specificity in both directions.
Algorithm Details
Differential Frequency Strategy
The pipeline employs an asymmetric approach that accounts for the natural differences between ribosomal subunits:
- Primary threshold (5000): Captures highly conserved regions specific to each subunit
- Contamination thresholds (8-10): Different values reflect abundance patterns in Silva database
- Set subtraction: Eliminates cross-reactive k-mers that could cause misclassification
K-mer Length Optimization
15-mer length provides optimal balance for ribosomal RNA analysis:
- Specificity: Long enough to avoid random matches (4^15 = ~1 billion possible k-mers)
- Sensitivity: Short enough to tolerate sequence variations within subunit groups
- Performance: Efficient for real-time classification while maintaining precision
Compression Strategy
Final k-mer sets use aggressive compression (zl=11) to:
- Minimize storage requirements for large k-mer databases
- Reduce memory footprint during classification
- Accelerate loading times for repeated analyses
- Enable distribution of compact reference databases
Usage
# Ensure Silva databases are present in working directory
ls -la lsu_deduped100pct.fa.gz ssu_deduped100pct.fa.gz
# Run the complete pipeline
bash make15mers.sh
# Monitor progress (pipeline includes multiple validation steps)
# Final outputs will be lsu_15mers.fa.gz and ssu_15mers.fa.gz
Output Files
Primary Outputs
- lsu_15mers.fa.gz - Final 23S/LSU-specific k-mer set
- ssu_15mers.fa.gz - Final 16S/SSU-specific k-mer set
Intermediate Files
- lsu_mincount5000_k15.fa - Initial high-frequency LSU k-mers
- ssu_mincount5000_k15.fa - Initial high-frequency SSU k-mers
- ssu_mincount10_k15.fa - Low-threshold SSU k-mers for LSU subtraction
- lsu_mincount8_k15.fa - Low-threshold LSU k-mers for SSU subtraction
- shared_10_k15.fa - K-mers found in both LSU and SSU (removed from LSU set)
- shared_8_k15.fa - K-mers found in both SSU and LSU (removed from SSU set)
- lsuonly_10_k15.fa - LSU-specific k-mers before compression
- ssuonly_8_k15.fa - SSU-specific k-mers before compression
Applications
Ribosomal Subunit Classification
# Classify unknown ribosomal sequences
bbduk.sh in=unknown_ribo.fq ref=lsu_15mers.fa.gz k=15 mm=f outm=classified_as_23S.fq outu=not_23S.fq
bbduk.sh in=not_23S.fq ref=ssu_15mers.fa.gz k=15 mm=f outm=classified_as_16S.fq outu=unclassified.fq
Quality Control for Ribosomal Assemblies
# Validate ribosomal gene assemblies
bbduk.sh in=assembled_contigs.fa ref=lsu_15mers.fa.gz,ssu_15mers.fa.gz k=15 stats=ribo_classification.txt
Contamination Detection
# Check for ribosomal contamination in non-ribosomal datasets
bbduk.sh in=genomic_reads.fq ref=lsu_15mers.fa.gz,ssu_15mers.fa.gz k=15 stats=contamination_check.txt
Performance Characteristics
- Memory Usage: Varies with Silva database size; k-mer counting is memory-intensive
- Runtime: Extended processing due to multiple k-mer counting and validation steps
- Disk I/O: Intensive due to large intermediate files and compression operations
- Final File Sizes: Highly compressed outputs significantly smaller than intermediate files
- Query Performance: Final k-mer sets optimized for fast classification queries
Technical Parameters
K-mer Counting Parameters
- k=15: K-mer length optimized for ribosomal RNA specificity
- rcomp=f: Disables reverse-complement to maintain directional information
- ow: Overwrites existing output files
Filtering Parameters
- mm=f: Requires exact matches during validation and subtraction
- k=15: Consistent k-mer length throughout pipeline
Compression Parameters
- zl=11: Maximum compression level for final output
- k=15: K-mer length for compression optimization
Future Development
The script includes a placeholder for 5S ribosomal RNA processing:
#5S: TODO
This indicates planned expansion to include 5S ribosomal RNA subunit classification, which would complete the full ribosomal RNA classification system (5S, 16S, 23S).
Quality Validation
The pipeline includes comprehensive validation at every major step:
- Initial coverage testing: Verifies k-mer sets capture source sequences effectively
- Post-subtraction validation: Confirms specificity doesn't compromise sensitivity
- Cross-contamination testing: Validates that subunit-specific sets don't cross-react
- Final compressed validation: Ensures compression doesn't affect classification performance
Related Tools
- kmercountexact.sh - Exact k-mer frequency counting
- bbduk.sh - K-mer filtering, matching, and set operations
- kcompress.sh - K-mer set compression and optimization
- filtersilva.sh - Silva database cleaning and filtering
- reducesilva.sh - Silva database size reduction
- makeRiboKmers.sh - Alternative ribosomal k-mer generation pipeline
Notes and Considerations
- Threshold Selection: Different minimum count thresholds (5000 vs 8-10) reflect empirically determined optimal values
- Directional Processing: rcomp=f throughout maintains strand directionality important for ribosomal classification
- Exact Matching: mm=f ensures precise k-mer matching without mismatches
- Overwrite Safety: ow flags allow pipeline re-runs by overwriting intermediate files
- Silva Dependency: Designed specifically for Silva database format and organization
- Memory Scaling: Performance scales with Silva database size and available system memory