makeRiboKmers.sh

Script: makeRiboKmers.sh Author: Brian Bushnell Package: Pipeline Scripts

Sophisticated pipeline for creating optimized ribosomal RNA kmer reference sets from Silva database. Uses iterative refinement, synthetic read generation, and sensitivity testing to create kmer sets at multiple depth thresholds for ribosomal RNA detection and classification.

Purpose

This advanced pipeline creates highly optimized kmer sets for ribosomal RNA detection by processing the Silva database through multiple stages of deduplication, masking, synthetic read generation, and iterative kmer set refinement. The resulting kmer sets provide different sensitivity levels for various applications requiring ribosomal RNA classification.

Prerequisites

Pipeline Stages

Stage 1: Initial Sequence Processing

# Link Silva database
ln -s /global/projectb/sandbox/gaag/bbtools/silva/latest/both_deduped_sorted.fa.gz .

# Aggressive deduplication (similarity=5, entropy=2)
dedupe.sh in=both_deduped_sorted.fa.gz out=dd2_e2_s5.fa.gz s=5 e=2 ordered zl=9 fastawrap=4000

# Conservative deduplication (similarity=2)
clumpify.sh in=both_deduped_sorted.fa.gz out=clumped_s2.fa.gz zl=9 s=2 dedupe fastawrap=4000 ow passes=2

Stage 2: Entropy Masking

Masks low-complexity regions to improve kmer set quality:

# Mask low-entropy regions in both datasets
bbduk.sh -Xmx1g in=dd2_e2_s5.fa.gz out=dd2_e2_s5_masked.fa.gz zl=9 entropy=0.6 entropyk=4 entropywindow=24 maskentropy ordered ow qtrim=rl trimq=1 fastawrap=4000

bbduk.sh -Xmx1g in=clumped_s2.fa.gz out=clumped_s2_masked.fa.gz zl=9 entropy=0.6 entropyk=4 entropywindow=24 maskentropy ordered ow qtrim=rl trimq=1 fastawrap=4000

Entropy Parameters:

Stage 3: Synthetic Read Generation

Creates synthetic reads for sensitivity testing:

# Generate 300M synthetic reads of length 100
randomreads.sh -Xmx31g adderrors=f ref=clumped_s2_masked.fa.gz reads=300m out=synth_s2.fa.gz len=100 zl=6 fastawrap=4000 illuminanames

randomreads.sh -Xmx8g adderrors=f ref=dd2_e2_s5_masked.fa.gz reads=300m out=synth_e2_s5.fa.gz len=100 zl=6 fastawrap=4000 illuminanames

Stage 4: Kmer Set Generation at Multiple Depths

Creates baseline kmer sets at different depth thresholds (1000, 500, 200, 100, 50, 40, 30, 20, 10, 8, 5, 3, 2, 1):

# Example for depth 1000 (pattern repeats for all depths)
kcompress.sh -Xmx31g ow zl=9 pigz=16 min=1000 in=dd2_e2_s5_masked.fa.gz out=stdout.fa | clumpify.sh -Xmx16g in=stdin.fa k=16 reorder groups=1 fastawrap=4000 ow zl=9 pigz=16 out=riboKmers1000A.fa.gz

Stage 5: Sensitivity Testing

Tests each kmer set against synthetic reads to find missed sequences:

# Find reads missed by each kmer set
bbduk.sh -Xmx8g in=synth_e2_s5_clumped_s1.fa.gz ref=riboKmers1000A.fa.gz out=read_misses1000A.fa.gz zl=6 k=31 mm=f ordered fastawrap=4000 ow

Stage 6: Iterative Refinement

Iteratively adds kmers from missed reads to improve sensitivity:

# Add kmers from missed reads (B cycle)
kcompress.sh -Xmx31g ow min=2000 in=read_misses1000A.fa.gz out=riboKmers1000B.fa.gz fastawrap=4000 ow zl=9 pigz=16

# Test combined set and find remaining misses
bbduk.sh -Xmx8g in=read_misses1000A.fa.gz ref=riboKmers1000A.fa.gz,riboKmers1000B.fa.gz out=read_misses1000B.fa.gz zl=6 k=31 mm=f ordered fastawrap=4000 ow

# Continue through cycles C, D, E...

Stage 7: Final Processing

Merges iterative kmer sets and creates final optimized references:

# Merge all cycles for each depth
kcompress.sh -Xmx31g ow in=riboKmers1000A.fa.gz,riboKmers1000B.fa.gz,riboKmers1000C.fa.gz,riboKmers1000D.fa.gz,riboKmers1000E.fa.gz out=riboKmers1000merged.fa.gz fastawrap=4000 ow zl=9 pigz=16

# Final clustering and optimization
clumpify.sh k=16 in=riboKmers1000merged.fa.gz out=riboKmers1000clumped.fa.gz g=1 zl=9 fastawrap=4000 reorder rcomp ow

# Create fused sequences for maximum efficiency
fuse.sh -Xmx1g ow in=riboKmers1000clumped.fa.gz out=riboKmers1000fused.fa.gz fastawrap=8000 ow zl=11 pigz=32 maxlen=4000 npad=1

Output Files

The pipeline generates multiple kmer sets with different sensitivity levels:

Primary Kmer Sets (per depth level)

Final Optimized Sets

Sensitivity Analysis Files

Depth Levels and Applications

Sensitivity vs Specificity Trade-offs

  • High Depth (1000, 500): High specificity, lower sensitivity - for precise classification
  • Medium Depth (200, 100, 50): Balanced sensitivity/specificity - general purpose
  • Low Depth (30, 20, 10): High sensitivity, lower specificity - broad detection
  • Very Low Depth (5, 3, 2, 1): Maximum sensitivity - comprehensive screening

Usage Example

# Run the complete pipeline
./makeRiboKmers.sh

# Use generated kmer set for ribosomal RNA detection
bbduk.sh in=sequences.fq ref=riboKmers100fused.fa.gz k=31 stats=ribo_stats.txt outm=ribosomal.fq

# Compare sensitivities
bbduk.sh in=test.fq ref=riboKmers1000fused.fa.gz k=31 stats=high_spec.txt
bbduk.sh in=test.fq ref=riboKmers10fused.fa.gz k=31 stats=high_sens.txt

Performance Characteristics

Algorithm Details

Related Tools