make15mers.sh

Script: make15mers.sh Source Directory: pipelines/silva/ Author: Brian Bushnell

Specialized Silva database processing pipeline for creating optimized 15-mer k-mer sets that can distinguish between ribosomal RNA subunits (23S/LSU and 16S/SSU). Uses differential k-mer frequency analysis and set subtraction to create subunit-specific k-mer references.

Overview

This pipeline processes deduplicated Silva ribosomal RNA databases to create highly specific 15-mer k-mer sets that can distinguish between large subunit (23S/LSU) and small subunit (16S/SSU) ribosomal RNAs. The approach uses differential minimum count thresholds and set subtraction to isolate subunit-specific k-mers while removing shared sequences that would cause classification ambiguity.

Key Innovation: Rather than using raw frequency counts, this pipeline creates complementary k-mer sets by identifying k-mers that are abundant in one subunit but rare or absent in the other, enabling precise ribosomal subunit classification.

Prerequisites

Required Input Files

System Requirements

Pipeline Strategy

The pipeline uses a sophisticated differential k-mer analysis approach:

Core Principle

  1. Generate initial k-mer sets from each subunit using high minimum count thresholds
  2. Create complementary sets from the opposite subunit using lower thresholds
  3. Subtract shared k-mers to isolate subunit-specific sequences
  4. Validate coverage against original sequences to ensure specificity
  5. Compress final sets for optimal storage and query performance

Threshold Strategy

Pipeline Stages

Stage 1: 23S/LSU K-mer Set Generation

1.1 Initial LSU K-mer Extraction

kmercountexact.sh in=lsu_deduped100pct.fa.gz mincount=5000 out=lsu_mincount5000_k15.fa k=15 rcomp=f ow

Extracts 15-mers from LSU sequences that appear at least 5000 times. The high threshold ensures only highly conserved LSU-specific regions are captured. Note: rcomp=f disables reverse-complement processing to maintain directional specificity.

1.2 Initial Coverage Validation

bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsu_mincount5000_k15.fa k=15 mm=f rcomp=f

Tests how well the initial k-mer set covers the original LSU sequences. The mm=f flag requires exact matches.

1.3 SSU Contamination K-mer Set

kmercountexact.sh in=ssu_deduped100pct.fa.gz mincount=10 out=ssu_mincount10_k15.fa k=15 rcomp=f ow

Creates a k-mer set from SSU sequences using a much lower threshold (mincount=10) to capture k-mers that might cause cross-contamination with LSU classification.

1.4 Set Subtraction for LSU Specificity

bbduk.sh ref=ssu_mincount10_k15.fa in=lsu_mincount5000_k15.fa k=15 mm=f outm=shared_10_k15.fa outu=lsuonly_10_k15.fa ow

Critical step: removes LSU k-mers that also appear in SSU sequences. This creates a truly LSU-specific k-mer set by:

1.5 Specificity Validation

bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsuonly_10_k15.fa k=15 mm=f rcomp=f

Validates that the LSU-specific k-mer set still provides good coverage of original LSU sequences after subtraction.

1.6 Final LSU Compression

kcompress.sh in=lsuonly_10_k15.fa k=15 out=lsu_15mers.fa.gz ow zl=11 rcomp=f

Compresses the LSU-specific k-mer set with maximum compression level (zl=11) for efficient storage and faster loading.

1.7 Final Coverage Test

bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsu_15mers.fa.gz k=15 mm=f rcomp=f

Final validation that the compressed LSU k-mer set maintains good coverage of the original sequences.

Stage 2: 16S/SSU K-mer Set Generation

2.1 Initial SSU K-mer Extraction

kmercountexact.sh in=ssu_deduped100pct.fa.gz mincount=5000 out=ssu_mincount5000_k15.fa k=15 rcomp=f ow

Extracts 15-mers from SSU sequences using the same high threshold (5000) as LSU processing to ensure only highly conserved SSU regions are captured.

2.2 Initial Coverage Validation

bbduk.sh in=ssu_deduped100pct.fa.gz ref=ssu_mincount5000_k15.fa k=15 mm=f rcomp=f

Validates coverage of the initial SSU k-mer set against original sequences.

2.3 LSU Contamination K-mer Set

kmercountexact.sh in=lsu_deduped100pct.fa.gz mincount=8 out=lsu_mincount8_k15.fa k=15 rcomp=f ow

Creates LSU k-mers using an even lower threshold (mincount=8) than used for SSU contamination detection, reflecting the different abundance patterns between subunits.

2.4 Set Subtraction for SSU Specificity

bbduk.sh ref=lsu_mincount8_k15.fa in=ssu_mincount5000_k15.fa k=15 mm=f outm=shared_8_k15.fa outu=ssuonly_8_k15.fa ow

Removes SSU k-mers that also appear in LSU sequences, creating a truly SSU-specific k-mer set:

2.5 Specificity Validation

bbduk.sh in=ssu_deduped100pct.fa.gz ref=ssuonly_8_k15.fa k=15 mm=f rcomp=f

Confirms that the SSU-specific k-mer set maintains good coverage after subtraction.

2.6 Final SSU Compression

kcompress.sh in=ssuonly_8_k15.fa k=15 out=ssu_15mers.fa.gz ow zl=11 rcomp=f

Compresses the SSU-specific k-mer set with maximum compression for efficient storage.

2.7 Final Coverage Test

bbduk.sh in=ssu_deduped100pct.fa.gz ref=ssu_15mers.fa.gz k=15 mm=f rcomp=f

Final validation of the compressed SSU k-mer set coverage.

Stage 3: Cross-Validation Testing

3.1 LSU vs SSU K-mer Test

bbduk.sh in=lsu_deduped100pct.fa.gz ref=ssu_15mers.fa.gz k=15 mm=f rcomp=f

Critical validation: tests LSU sequences against the SSU k-mer set. Ideally, this should show minimal matches, confirming the k-mer sets are truly subunit-specific.

3.2 SSU vs LSU K-mer Test

bbduk.sh in=ssu_deduped100pct.fa.gz ref=lsu_15mers.fa.gz k=15 mm=f rcomp=f

Reciprocal test: validates SSU sequences against the LSU k-mer set to confirm specificity in both directions.

Algorithm Details

Differential Frequency Strategy

The pipeline employs an asymmetric approach that accounts for the natural differences between ribosomal subunits:

K-mer Length Optimization

15-mer length provides optimal balance for ribosomal RNA analysis:

Compression Strategy

Final k-mer sets use aggressive compression (zl=11) to:

Usage

# Ensure Silva databases are present in working directory
ls -la lsu_deduped100pct.fa.gz ssu_deduped100pct.fa.gz

# Run the complete pipeline
bash make15mers.sh

# Monitor progress (pipeline includes multiple validation steps)
# Final outputs will be lsu_15mers.fa.gz and ssu_15mers.fa.gz

Output Files

Primary Outputs

Intermediate Files

Applications

Ribosomal Subunit Classification

# Classify unknown ribosomal sequences
bbduk.sh in=unknown_ribo.fq ref=lsu_15mers.fa.gz k=15 mm=f outm=classified_as_23S.fq outu=not_23S.fq
bbduk.sh in=not_23S.fq ref=ssu_15mers.fa.gz k=15 mm=f outm=classified_as_16S.fq outu=unclassified.fq

Quality Control for Ribosomal Assemblies

# Validate ribosomal gene assemblies
bbduk.sh in=assembled_contigs.fa ref=lsu_15mers.fa.gz,ssu_15mers.fa.gz k=15 stats=ribo_classification.txt

Contamination Detection

# Check for ribosomal contamination in non-ribosomal datasets
bbduk.sh in=genomic_reads.fq ref=lsu_15mers.fa.gz,ssu_15mers.fa.gz k=15 stats=contamination_check.txt

Performance Characteristics

Technical Parameters

K-mer Counting Parameters

Filtering Parameters

Compression Parameters

Future Development

The script includes a placeholder for 5S ribosomal RNA processing:

#5S: TODO

This indicates planned expansion to include 5S ribosomal RNA subunit classification, which would complete the full ribosomal RNA classification system (5S, 16S, 23S).

Quality Validation

The pipeline includes comprehensive validation at every major step:

Validation Strategy: Each bbduk validation command reports statistics including percentage of sequences matched, helping assess both sensitivity (coverage of target sequences) and specificity (lack of cross-reaction).

Related Tools

Notes and Considerations