makeRep Pipeline

Script: makeRep.sh Source Directory: pipelines/silva/ Author: Brian Bushnell

Silva database representative set generation pipeline that creates representative taxa sets at multiple ANI (Average Nucleotide Identity) thresholds. This pipeline processes all-to-all identity comparison results to generate minimal representative sets of sequences at different similarity levels.

Overview

This pipeline is designed to process Silva ribosomal RNA database sequences for creating representative sets at multiple taxonomic resolution levels. It takes pre-computed all-to-all ANI comparison results and generates minimal representative sets where all original sequences are within a specified distance of at least one representative.

The pipeline runs the representative.sh tool six times with progressively stringent thresholds:

Note: This pipeline is specifically designed for Silva database processing and expects input from CompareSketch with format=3 and usetaxidname options.

Prerequisites

Input Requirements

System Requirements

Pipeline Stages

Stage 1: 95% ANI Representative Set

representative.sh in=results_ani94_sr00_wkid.txt thresh=0.95 minratio=0.95 out=rep_95_95.txt ow -Xmx31g

Creates a representative set where sequences are clustered at 95% Average Nucleotide Identity with a minimum size ratio of 95%. This produces the most inclusive clustering, suitable for species-level grouping.

Stage 2: 97% ANI Representative Set

representative.sh in=results_ani94_sr00_wkid.txt thresh=0.97 minratio=0.97 out=rep_97_97.txt ow -Xmx31g

Creates a more stringent representative set at 97% ANI threshold with 97% size ratio requirement. This level is commonly used for bacterial species delimitation.

Stage 3: 98% ANI Representative Set

representative.sh in=results_ani94_sr00_wkid.txt thresh=0.98 minratio=0.98 out=rep_98_98.txt ow -Xmx31g

Higher confidence species clustering at 98% ANI threshold. Provides tighter taxonomic grouping with increased specificity.

Stage 4: 99% ANI Representative Set

representative.sh in=results_ani94_sr00_wkid.txt thresh=0.99 minratio=0.99 out=rep_99_99.txt ow -Xmx31g

Near-identical sequence clustering at 99% ANI. Suitable for identifying very closely related strains or subspecies.

Stage 5: 99.5% ANI Representative Set

representative.sh in=results_ani94_sr00_wkid.txt thresh=0.995 minratio=0.995 out=rep_995_995.txt ow -Xmx31g

Sub-strain level clustering at 99.5% ANI threshold. Captures minute sequence variations while maintaining representative structure.

Stage 6: 99.9% ANI Representative Set

representative.sh in=results_ani94_sr00_wkid.txt thresh=0.999 minratio=0.999 out=rep_999_999.txt ow -Xmx31g

Nearly perfect sequence clustering at 99.9% ANI. This produces the most fine-grained representative set, distinguishing sequences with minimal differences.

Algorithm Details

Representative Set Selection Algorithm

The underlying representative.sh tool implements a graph-based algorithm for minimal representative set selection:

Node Selection Strategy

Filtering Mechanisms

Scoring Function

Node priority is determined by: score = sum(edge_weights) + 0.25 * sum(edge_weights) * log(node_size)

This scoring favors nodes with many high-quality connections while giving a size bonus to larger sequences, ensuring representatives are well-connected and substantial.

Basic Usage

# 1. Ensure input file is available
# Input should be from CompareSketch with format=3 and usetaxidname
ls results_ani94_sr00_wkid.txt

# 2. Run the pipeline
bash makeRep.sh

# 3. Output files will be generated at each threshold level

Input File Format

The pipeline expects a tab-separated values file with the following columns:

  1. Query ID - NCBI taxonomic ID or sequence identifier
  2. Reference ID - NCBI taxonomic ID or sequence identifier
  3. ANI - Average Nucleotide Identity (0.0 to 1.0)
  4. Query Size - Query sequence size in unique k-mers (optional)
  5. Reference Size - Reference sequence size in unique k-mers (optional)
  6. Query Bases - Query sequence length in base pairs (optional)
  7. Reference Bases - Reference sequence length in base pairs (optional)

Example Input

Query	Ref	ANI	QSize	RSize	QBases	RBases
9606	10090	0.956	1500000	1400000	3000000000	2800000000
562	83333	0.982	850000	820000	4600000	4500000

Output Files

Generated Files

Output Format

Each output file contains representative taxa IDs with optional size and cluster information:

#Representative	Size	NodeCount	Nodes
9606	1500000	3	9606,10090,9598
562	850000	2	562,83333

Output Columns

Memory and Performance

Memory Allocation

The pipeline allocates 31GB of RAM for each representative.sh run. For smaller datasets, this can be reduced:

# For smaller datasets, reduce memory allocation
# Edit the script and change -Xmx31g to a smaller value like -Xmx8g

Processing Time

Processing time depends on:

For very large datasets (millions of sequences), consider running stages individually with different memory settings.

Silva Database Context

Purpose in Silva Processing

This pipeline is part of the Silva ribosomal RNA database processing workflow. It creates representative sets at multiple resolution levels to:

Integration with Other Silva Tools

This pipeline works in conjunction with other Silva processing scripts:

Threshold Selection Guidelines

Biological Significance of Thresholds

95% ANI
Genus-level clustering. Groups sequences from closely related species within the same genus.
97% ANI
Species-level clustering. Standard threshold for bacterial species delimitation in 16S rRNA analysis.
98% ANI
High-confidence species clustering. Reduces false grouping of distinct species.
99% ANI
Strain-level clustering. Groups sequences from the same or very closely related strains.
99.5% ANI
Sub-strain clustering. Captures minute variations within strain populations.
99.9% ANI
Near-identical clustering. Groups only sequences with minimal differences, likely from the same organism or laboratory strain.

Customization Options

Modifying Thresholds

To customize the ANI thresholds, edit the makeRep.sh script and modify the thresh and minratio parameters:

# Example: Add a 96% threshold
representative.sh in=results_ani94_sr00_wkid.txt thresh=0.96 minratio=0.96 out=rep_96_96.txt ow -Xmx31g

Memory Optimization

For smaller datasets or memory-constrained systems:

# Reduce memory allocation (example for 8GB systems)
representative.sh in=results_ani94_sr00_wkid.txt thresh=0.95 minratio=0.95 out=rep_95_95.txt ow -Xmx6g

Additional Filtering

The representative.sh tool supports additional filtering options that can be added to each stage:

Quality Control and Validation

Checking Output Quality

# Count representatives at each level
wc -l rep_*.txt

# Check coverage statistics by examining the log output
# Look for "Valid Nodes", "Invalid Nodes", and size statistics

# Compare representative counts across thresholds
# Higher thresholds should produce more representatives

Expected Behavior

Troubleshooting

Common Issues

Out of Memory Errors

If the pipeline runs out of memory:

Input File Format Errors

If representative.sh reports format errors:

Empty Output Files

If output files are empty or very small:

Integration with Downstream Analysis

Using Representative Sets

The generated representative sets can be used for:

Downstream BBTools Integration

# Extract representative sequences from Silva FASTA
filterbyname.sh in=silva_sequences.fasta out=silva_representatives_97.fasta names=rep_97_97.txt include=t

# Build sketch database from representatives
sketch.sh in=silva_representatives_97.fasta out=silva_sketch_97.sketch

Notes and Best Practices