make15mers.sh - BBTools Pipeline

Overview

This pipeline processes deduplicated Silva ribosomal RNA databases to create highly specific 15-mer k-mer sets that can distinguish between large subunit (23S/LSU) and small subunit (16S/SSU) ribosomal RNAs. The approach uses differential minimum count thresholds and set subtraction to isolate subunit-specific k-mers while removing shared sequences that would cause classification ambiguity.

Key Innovation: Rather than using raw frequency counts, this pipeline creates complementary k-mer sets by identifying k-mers that are abundant in one subunit but rare or absent in the other, enabling precise ribosomal subunit classification.

Prerequisites

Required Input Files

lsu_deduped100pct.fa.gz - 100% deduplicated large subunit (23S) Silva sequences
ssu_deduped100pct.fa.gz - 100% deduplicated small subunit (16S) Silva sequences

System Requirements

BBTools suite with kmercountexact, bbduk, and kcompress
Sufficient memory for k-mer counting on large Silva databases
Adequate disk space for intermediate k-mer sets and final compressed outputs

Pipeline Strategy

The pipeline uses a sophisticated differential k-mer analysis approach:

Core Principle

Generate initial k-mer sets from each subunit using high minimum count thresholds
Create complementary sets from the opposite subunit using lower thresholds
Subtract shared k-mers to isolate subunit-specific sequences
Validate coverage against original sequences to ensure specificity
Compress final sets for optimal storage and query performance

Threshold Strategy

23S Processing: Uses mincount=5000 for initial set, then subtracts SSU k-mers with mincount=10
16S Processing: Uses mincount=5000 for initial set, then subtracts LSU k-mers with mincount=8
Asymmetric thresholds: Different subtraction thresholds (8 vs 10) account for natural abundance differences between subunits

Pipeline Stages

Stage 1: 23S/LSU K-mer Set Generation

1.1 Initial LSU K-mer Extraction

kmercountexact.sh in=lsu_deduped100pct.fa.gz mincount=5000 out=lsu_mincount5000_k15.fa k=15 rcomp=f ow

Extracts 15-mers from LSU sequences that appear at least 5000 times. The high threshold ensures only highly conserved LSU-specific regions are captured. Note: rcomp=f disables reverse-complement processing to maintain directional specificity.

1.2 Initial Coverage Validation

bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsu_mincount5000_k15.fa k=15 mm=f rcomp=f

Tests how well the initial k-mer set covers the original LSU sequences. The mm=f flag requires exact matches.

1.3 SSU Contamination K-mer Set

kmercountexact.sh in=ssu_deduped100pct.fa.gz mincount=10 out=ssu_mincount10_k15.fa k=15 rcomp=f ow

Creates a k-mer set from SSU sequences using a much lower threshold (mincount=10) to capture k-mers that might cause cross-contamination with LSU classification.

1.4 Set Subtraction for LSU Specificity

bbduk.sh ref=ssu_mincount10_k15.fa in=lsu_mincount5000_k15.fa k=15 mm=f outm=shared_10_k15.fa outu=lsuonly_10_k15.fa ow

Critical step: removes LSU k-mers that also appear in SSU sequences. This creates a truly LSU-specific k-mer set by:

outm=shared_10_k15.fa - K-mers found in both subunits (discarded)
outu=lsuonly_10_k15.fa - K-mers specific to LSU (kept)

1.5 Specificity Validation

bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsuonly_10_k15.fa k=15 mm=f rcomp=f

Validates that the LSU-specific k-mer set still provides good coverage of original LSU sequences after subtraction.

1.6 Final LSU Compression

kcompress.sh in=lsuonly_10_k15.fa k=15 out=lsu_15mers.fa.gz ow zl=11 rcomp=f

Compresses the LSU-specific k-mer set with maximum compression level (zl=11) for efficient storage and faster loading.

1.7 Final Coverage Test

bbduk.sh in=lsu_deduped100pct.fa.gz ref=lsu_15mers.fa.gz k=15 mm=f rcomp=f

Final validation that the compressed LSU k-mer set maintains good coverage of the original sequences.

Stage 2: 16S/SSU K-mer Set Generation

2.1 Initial SSU K-mer Extraction

kmercountexact.sh in=ssu_deduped100pct.fa.gz mincount=5000 out=ssu_mincount5000_k15.fa k=15 rcomp=f ow

Extracts 15-mers from SSU sequences using the same high threshold (5000) as LSU processing to ensure only highly conserved SSU regions are captured.

2.2 Initial Coverage Validation

bbduk.sh in=ssu_deduped100pct.fa.gz ref=ssu_mincount5000_k15.fa k=15 mm=f rcomp=f

Validates coverage of the initial SSU k-mer set against original sequences.

2.3 LSU Contamination K-mer Set

kmercountexact.sh in=lsu_deduped100pct.fa.gz mincount=8 out=lsu_mincount8_k15.fa k=15 rcomp=f ow

Creates LSU k-mers using an even lower threshold (mincount=8) than used for SSU contamination detection, reflecting the different abundance patterns between subunits.

2.4 Set Subtraction for SSU Specificity

bbduk.sh ref=lsu_mincount8_k15.fa in=ssu_mincount5000_k15.fa k=15 mm=f outm=shared_8_k15.fa outu=ssuonly_8_k15.fa ow

Removes SSU k-mers that also appear in LSU sequences, creating a truly SSU-specific k-mer set:

outm=shared_8_k15.fa - K-mers found in both subunits (discarded)
outu=ssuonly_8_k15.fa - K-mers specific to SSU (kept)

2.5 Specificity Validation

bbduk.sh in=ssu_deduped100pct.fa.gz ref=ssuonly_8_k15.fa k=15 mm=f rcomp=f

Confirms that the SSU-specific k-mer set maintains good coverage after subtraction.

2.6 Final SSU Compression

kcompress.sh in=ssuonly_8_k15.fa k=15 out=ssu_15mers.fa.gz ow zl=11 rcomp=f

Compresses the SSU-specific k-mer set with maximum compression for efficient storage.

2.7 Final Coverage Test

bbduk.sh in=ssu_deduped100pct.fa.gz ref=ssu_15mers.fa.gz k=15 mm=f rcomp=f

Final validation of the compressed SSU k-mer set coverage.

Stage 3: Cross-Validation Testing

3.1 LSU vs SSU K-mer Test

bbduk.sh in=lsu_deduped100pct.fa.gz ref=ssu_15mers.fa.gz k=15 mm=f rcomp=f

Critical validation: tests LSU sequences against the SSU k-mer set. Ideally, this should show minimal matches, confirming the k-mer sets are truly subunit-specific.

3.2 SSU vs LSU K-mer Test

bbduk.sh in=ssu_deduped100pct.fa.gz ref=lsu_15mers.fa.gz k=15 mm=f rcomp=f

Reciprocal test: validates SSU sequences against the LSU k-mer set to confirm specificity in both directions.

Algorithm Details

Differential Frequency Strategy

The pipeline employs an asymmetric approach that accounts for the natural differences between ribosomal subunits:

Primary threshold (5000): Captures highly conserved regions specific to each subunit
Contamination thresholds (8-10): Different values reflect abundance patterns in Silva database
Set subtraction: Eliminates cross-reactive k-mers that could cause misclassification

K-mer Length Optimization

15-mer length provides optimal balance for ribosomal RNA analysis:

Specificity: Long enough to avoid random matches (4^15 = ~1 billion possible k-mers)
Sensitivity: Short enough to tolerate sequence variations within subunit groups
Performance: Efficient for real-time classification while maintaining precision

Compression Strategy

Final k-mer sets use aggressive compression (zl=11) to:

Minimize storage requirements for large k-mer databases
Reduce memory footprint during classification
Accelerate loading times for repeated analyses
Enable distribution of compact reference databases

Usage

# Ensure Silva databases are present in working directory
ls -la lsu_deduped100pct.fa.gz ssu_deduped100pct.fa.gz

# Run the complete pipeline
bash make15mers.sh

# Monitor progress (pipeline includes multiple validation steps)
# Final outputs will be lsu_15mers.fa.gz and ssu_15mers.fa.gz

Output Files

Primary Outputs

lsu_15mers.fa.gz - Final 23S/LSU-specific k-mer set
ssu_15mers.fa.gz - Final 16S/SSU-specific k-mer set

Intermediate Files

lsu_mincount5000_k15.fa - Initial high-frequency LSU k-mers
ssu_mincount5000_k15.fa - Initial high-frequency SSU k-mers
ssu_mincount10_k15.fa - Low-threshold SSU k-mers for LSU subtraction
lsu_mincount8_k15.fa - Low-threshold LSU k-mers for SSU subtraction
shared_10_k15.fa - K-mers found in both LSU and SSU (removed from LSU set)
shared_8_k15.fa - K-mers found in both SSU and LSU (removed from SSU set)
lsuonly_10_k15.fa - LSU-specific k-mers before compression
ssuonly_8_k15.fa - SSU-specific k-mers before compression

Applications

Ribosomal Subunit Classification

# Classify unknown ribosomal sequences
bbduk.sh in=unknown_ribo.fq ref=lsu_15mers.fa.gz k=15 mm=f outm=classified_as_23S.fq outu=not_23S.fq
bbduk.sh in=not_23S.fq ref=ssu_15mers.fa.gz k=15 mm=f outm=classified_as_16S.fq outu=unclassified.fq

Quality Control for Ribosomal Assemblies

# Validate ribosomal gene assemblies
bbduk.sh in=assembled_contigs.fa ref=lsu_15mers.fa.gz,ssu_15mers.fa.gz k=15 stats=ribo_classification.txt

Contamination Detection

# Check for ribosomal contamination in non-ribosomal datasets
bbduk.sh in=genomic_reads.fq ref=lsu_15mers.fa.gz,ssu_15mers.fa.gz k=15 stats=contamination_check.txt

Performance Characteristics

Memory Usage: Varies with Silva database size; k-mer counting is memory-intensive
Runtime: Extended processing due to multiple k-mer counting and validation steps
Disk I/O: Intensive due to large intermediate files and compression operations
Final File Sizes: Highly compressed outputs significantly smaller than intermediate files
Query Performance: Final k-mer sets optimized for fast classification queries

Technical Parameters

K-mer Counting Parameters

k=15: K-mer length optimized for ribosomal RNA specificity
rcomp=f: Disables reverse-complement to maintain directional information
ow: Overwrites existing output files

Filtering Parameters

mm=f: Requires exact matches during validation and subtraction
k=15: Consistent k-mer length throughout pipeline

Compression Parameters

zl=11: Maximum compression level for final output
k=15: K-mer length for compression optimization

Future Development

The script includes a placeholder for 5S ribosomal RNA processing:

#5S: TODO

This indicates planned expansion to include 5S ribosomal RNA subunit classification, which would complete the full ribosomal RNA classification system (5S, 16S, 23S).

Quality Validation

The pipeline includes comprehensive validation at every major step:

Initial coverage testing: Verifies k-mer sets capture source sequences effectively
Post-subtraction validation: Confirms specificity doesn't compromise sensitivity
Cross-contamination testing: Validates that subunit-specific sets don't cross-react
Final compressed validation: Ensures compression doesn't affect classification performance

Validation Strategy: Each bbduk validation command reports statistics including percentage of sequences matched, helping assess both sensitivity (coverage of target sequences) and specificity (lack of cross-reaction).

Related Tools

kmercountexact.sh - Exact k-mer frequency counting
bbduk.sh - K-mer filtering, matching, and set operations
kcompress.sh - K-mer set compression and optimization
filtersilva.sh - Silva database cleaning and filtering
reducesilva.sh - Silva database size reduction
makeRiboKmers.sh - Alternative ribosomal k-mer generation pipeline

Notes and Considerations

Threshold Selection: Different minimum count thresholds (5000 vs 8-10) reflect empirically determined optimal values
Directional Processing: rcomp=f throughout maintains strand directionality important for ribosomal classification
Exact Matching: mm=f ensures precise k-mer matching without mismatches
Overwrite Safety: ow flags allow pipeline re-runs by overwriting intermediate files
Silva Dependency: Designed specifically for Silva database format and organization
Memory Scaling: Performance scales with Silva database size and available system memory