makeRiboKmers.sh
Sophisticated pipeline for creating optimized ribosomal RNA kmer reference sets from Silva database. Uses iterative refinement, synthetic read generation, and sensitivity testing to create kmer sets at multiple depth thresholds for ribosomal RNA detection and classification.
Purpose
This advanced pipeline creates highly optimized kmer sets for ribosomal RNA detection by processing the Silva database through multiple stages of deduplication, masking, synthetic read generation, and iterative kmer set refinement. The resulting kmer sets provide different sensitivity levels for various applications requiring ribosomal RNA classification.
Prerequisites
- Silva Database: Linked copy of deduplicated Silva (LSU and SSU) at
both_deduped_sorted.fa.gz
- High Memory System: Requires up to 31GB RAM for processing
- Significant Disk Space: Multiple intermediate files and final kmer sets
- Processing Time: Extended runtime due to iterative refinement cycles
Pipeline Stages
Stage 1: Initial Sequence Processing
# Link Silva database
ln -s /global/projectb/sandbox/gaag/bbtools/silva/latest/both_deduped_sorted.fa.gz .
# Aggressive deduplication (similarity=5, entropy=2)
dedupe.sh in=both_deduped_sorted.fa.gz out=dd2_e2_s5.fa.gz s=5 e=2 ordered zl=9 fastawrap=4000
# Conservative deduplication (similarity=2)
clumpify.sh in=both_deduped_sorted.fa.gz out=clumped_s2.fa.gz zl=9 s=2 dedupe fastawrap=4000 ow passes=2
Stage 2: Entropy Masking
Masks low-complexity regions to improve kmer set quality:
# Mask low-entropy regions in both datasets
bbduk.sh -Xmx1g in=dd2_e2_s5.fa.gz out=dd2_e2_s5_masked.fa.gz zl=9 entropy=0.6 entropyk=4 entropywindow=24 maskentropy ordered ow qtrim=rl trimq=1 fastawrap=4000
bbduk.sh -Xmx1g in=clumped_s2.fa.gz out=clumped_s2_masked.fa.gz zl=9 entropy=0.6 entropyk=4 entropywindow=24 maskentropy ordered ow qtrim=rl trimq=1 fastawrap=4000
Entropy Parameters:
entropy=0.6
- Minimum entropy thresholdentropyk=4
- Kmer size for entropy calculationentropywindow=24
- Window size for entropy assessment
Stage 3: Synthetic Read Generation
Creates synthetic reads for sensitivity testing:
# Generate 300M synthetic reads of length 100
randomreads.sh -Xmx31g adderrors=f ref=clumped_s2_masked.fa.gz reads=300m out=synth_s2.fa.gz len=100 zl=6 fastawrap=4000 illuminanames
randomreads.sh -Xmx8g adderrors=f ref=dd2_e2_s5_masked.fa.gz reads=300m out=synth_e2_s5.fa.gz len=100 zl=6 fastawrap=4000 illuminanames
Stage 4: Kmer Set Generation at Multiple Depths
Creates baseline kmer sets at different depth thresholds (1000, 500, 200, 100, 50, 40, 30, 20, 10, 8, 5, 3, 2, 1):
# Example for depth 1000 (pattern repeats for all depths)
kcompress.sh -Xmx31g ow zl=9 pigz=16 min=1000 in=dd2_e2_s5_masked.fa.gz out=stdout.fa | clumpify.sh -Xmx16g in=stdin.fa k=16 reorder groups=1 fastawrap=4000 ow zl=9 pigz=16 out=riboKmers1000A.fa.gz
Stage 5: Sensitivity Testing
Tests each kmer set against synthetic reads to find missed sequences:
# Find reads missed by each kmer set
bbduk.sh -Xmx8g in=synth_e2_s5_clumped_s1.fa.gz ref=riboKmers1000A.fa.gz out=read_misses1000A.fa.gz zl=6 k=31 mm=f ordered fastawrap=4000 ow
Stage 6: Iterative Refinement
Iteratively adds kmers from missed reads to improve sensitivity:
# Add kmers from missed reads (B cycle)
kcompress.sh -Xmx31g ow min=2000 in=read_misses1000A.fa.gz out=riboKmers1000B.fa.gz fastawrap=4000 ow zl=9 pigz=16
# Test combined set and find remaining misses
bbduk.sh -Xmx8g in=read_misses1000A.fa.gz ref=riboKmers1000A.fa.gz,riboKmers1000B.fa.gz out=read_misses1000B.fa.gz zl=6 k=31 mm=f ordered fastawrap=4000 ow
# Continue through cycles C, D, E...
Stage 7: Final Processing
Merges iterative kmer sets and creates final optimized references:
# Merge all cycles for each depth
kcompress.sh -Xmx31g ow in=riboKmers1000A.fa.gz,riboKmers1000B.fa.gz,riboKmers1000C.fa.gz,riboKmers1000D.fa.gz,riboKmers1000E.fa.gz out=riboKmers1000merged.fa.gz fastawrap=4000 ow zl=9 pigz=16
# Final clustering and optimization
clumpify.sh k=16 in=riboKmers1000merged.fa.gz out=riboKmers1000clumped.fa.gz g=1 zl=9 fastawrap=4000 reorder rcomp ow
# Create fused sequences for maximum efficiency
fuse.sh -Xmx1g ow in=riboKmers1000clumped.fa.gz out=riboKmers1000fused.fa.gz fastawrap=8000 ow zl=11 pigz=32 maxlen=4000 npad=1
Output Files
The pipeline generates multiple kmer sets with different sensitivity levels:
Primary Kmer Sets (per depth level)
riboKmers[depth]A.fa.gz
- Initial kmer setriboKmers[depth]B.fa.gz
- First refinement cycleriboKmers[depth]C.fa.gz
- Second refinement cycleriboKmers[depth]D.fa.gz
- Third refinement cycleriboKmers[depth]E.fa.gz
- Fourth refinement cycle
Final Optimized Sets
riboKmers[depth]merged.fa.gz
- Combined cyclesriboKmers[depth]clumped.fa.gz
- Clustered kmersriboKmers[depth]fused.fa.gz
- Final optimized set
Sensitivity Analysis Files
read_misses[depth][cycle].fa.gz
- Reads missed by each kmer setsynth_*.fa.gz
- Synthetic test reads
Depth Levels and Applications
Sensitivity vs Specificity Trade-offs
- High Depth (1000, 500): High specificity, lower sensitivity - for precise classification
- Medium Depth (200, 100, 50): Balanced sensitivity/specificity - general purpose
- Low Depth (30, 20, 10): High sensitivity, lower specificity - broad detection
- Very Low Depth (5, 3, 2, 1): Maximum sensitivity - comprehensive screening
Usage Example
# Run the complete pipeline
./makeRiboKmers.sh
# Use generated kmer set for ribosomal RNA detection
bbduk.sh in=sequences.fq ref=riboKmers100fused.fa.gz k=31 stats=ribo_stats.txt outm=ribosomal.fq
# Compare sensitivities
bbduk.sh in=test.fq ref=riboKmers1000fused.fa.gz k=31 stats=high_spec.txt
bbduk.sh in=test.fq ref=riboKmers10fused.fa.gz k=31 stats=high_sens.txt
Performance Characteristics
- Memory Usage: Up to 31GB RAM during kmer compression phases
- Runtime: Several hours to complete all depth levels and refinement cycles
- Disk Space: Multiple GB for intermediate and final files
- CPU Intensive: Benefits from high core count systems
- I/O Intensive: Significant disk read/write operations
Algorithm Details
- Iterative Refinement: Each cycle identifies missed sequences and adds targeted kmers
- Entropy Masking: Removes low-complexity regions that could cause false positives
- Synthetic Testing: Uses artificial reads to objectively measure sensitivity
- Multi-depth Strategy: Different applications require different sensitivity/specificity balances
- Sequence Fusing: Final step combines kmers into longer sequences for efficiency
Related Tools
dedupe.sh
- Sequence deduplicationclumpify.sh
- Sequence clustering and deduplicationbbduk.sh
- Kmer filtering and quality controlkcompress.sh
- Kmer set generation and compressionrandomreads.sh
- Synthetic read generationfuse.sh
- Sequence fusion for optimization