makeRep Pipeline

Overview

This pipeline is designed to process Silva ribosomal RNA database sequences for creating representative sets at multiple taxonomic resolution levels. It takes pre-computed all-to-all ANI comparison results and generates minimal representative sets where all original sequences are within a specified distance of at least one representative.

The pipeline runs the representative.sh tool six times with progressively stringent thresholds:

95% ANI / 95% ratio - Species-level clustering
97% ANI / 97% ratio - High-confidence species clustering
98% ANI / 98% ratio - Very stringent species clustering
99% ANI / 99% ratio - Near-identical sequence clustering
99.5% ANI / 99.5% ratio - Sub-strain level clustering
99.9% ANI / 99.9% ratio - Nearly perfect sequence clustering

Note: This pipeline is specifically designed for Silva database processing and expects input from CompareSketch with format=3 and usetaxidname options.

Prerequisites

Input Requirements

results_ani94_sr00_wkid.txt - All-to-all identity comparison results in TSV format
Input file must contain at least 3 columns: query, reference, ANI
Optional columns: qsize, rsize, qbases, rbases (as produced by CompareSketch)
File should be generated using CompareSketch with format=3 and usetaxidname

System Requirements

BBTools suite installed with representative.sh available
Sufficient memory (31GB allocated by default)
Working directory with write permissions

Pipeline Stages

Stage 1: 95% ANI Representative Set

representative.sh in=results_ani94_sr00_wkid.txt thresh=0.95 minratio=0.95 out=rep_95_95.txt ow -Xmx31g

Creates a representative set where sequences are clustered at 95% Average Nucleotide Identity with a minimum size ratio of 95%. This produces the most inclusive clustering, suitable for species-level grouping.

Stage 2: 97% ANI Representative Set

representative.sh in=results_ani94_sr00_wkid.txt thresh=0.97 minratio=0.97 out=rep_97_97.txt ow -Xmx31g

Creates a more stringent representative set at 97% ANI threshold with 97% size ratio requirement. This level is commonly used for bacterial species delimitation.

Stage 3: 98% ANI Representative Set

representative.sh in=results_ani94_sr00_wkid.txt thresh=0.98 minratio=0.98 out=rep_98_98.txt ow -Xmx31g

Higher confidence species clustering at 98% ANI threshold. Provides tighter taxonomic grouping with increased specificity.

Stage 4: 99% ANI Representative Set

representative.sh in=results_ani94_sr00_wkid.txt thresh=0.99 minratio=0.99 out=rep_99_99.txt ow -Xmx31g

Near-identical sequence clustering at 99% ANI. Suitable for identifying very closely related strains or subspecies.

Stage 5: 99.5% ANI Representative Set

representative.sh in=results_ani94_sr00_wkid.txt thresh=0.995 minratio=0.995 out=rep_995_995.txt ow -Xmx31g

Sub-strain level clustering at 99.5% ANI threshold. Captures minute sequence variations while maintaining representative structure.

Stage 6: 99.9% ANI Representative Set

representative.sh in=results_ani94_sr00_wkid.txt thresh=0.999 minratio=0.999 out=rep_999_999.txt ow -Xmx31g

Nearly perfect sequence clustering at 99.9% ANI. This produces the most fine-grained representative set, distinguishing sequences with minimal differences.

Algorithm Details

Representative Set Selection Algorithm

The underlying representative.sh tool implements a graph-based algorithm for minimal representative set selection:

Node Selection Strategy

Score-based ranking: Nodes are sorted by a composite score combining edge weights and logarithmic size bonus
Greedy selection: Higher-scoring nodes are selected first as representatives
Coverage guarantee: Each original node must be within threshold distance of at least one representative
Singleton handling: Isolated nodes are automatically included as representatives

Filtering Mechanisms

Distance threshold: Edges below the ANI threshold are ignored
Size ratio filtering: Edges between sequences with size ratios below minratio are discarded
Size-based filtering: Nodes can be filtered by estimated size in unique k-mers or total bases
Taxonomic filtering: Optional filtering by NCBI taxonomic IDs or names

Scoring Function

Node priority is determined by: score = sum(edge_weights) + 0.25 * sum(edge_weights) * log(node_size)

This scoring favors nodes with many high-quality connections while giving a size bonus to larger sequences, ensuring representatives are well-connected and substantial.

Basic Usage

# 1. Ensure input file is available
# Input should be from CompareSketch with format=3 and usetaxidname
ls results_ani94_sr00_wkid.txt

# 2. Run the pipeline
bash makeRep.sh

# 3. Output files will be generated at each threshold level

Input File Format

The pipeline expects a tab-separated values file with the following columns:

Query ID - NCBI taxonomic ID or sequence identifier
Reference ID - NCBI taxonomic ID or sequence identifier
ANI - Average Nucleotide Identity (0.0 to 1.0)
Query Size - Query sequence size in unique k-mers (optional)
Reference Size - Reference sequence size in unique k-mers (optional)
Query Bases - Query sequence length in base pairs (optional)
Reference Bases - Reference sequence length in base pairs (optional)

Example Input

Query	Ref	ANI	QSize	RSize	QBases	RBases
9606	10090	0.956	1500000	1400000	3000000000	2800000000
562	83333	0.982	850000	820000	4600000	4500000

Output Files

Generated Files

rep_95_95.txt - Representative set at 95% ANI/ratio threshold
rep_97_97.txt - Representative set at 97% ANI/ratio threshold
rep_98_98.txt - Representative set at 98% ANI/ratio threshold
rep_99_99.txt - Representative set at 99% ANI/ratio threshold
rep_995_995.txt - Representative set at 99.5% ANI/ratio threshold
rep_999_999.txt - Representative set at 99.9% ANI/ratio threshold

Output Format

Each output file contains representative taxa IDs with optional size and cluster information:

#Representative	Size	NodeCount	Nodes
9606	1500000	3	9606,10090,9598
562	850000	2	562,83333

Output Columns

Representative - ID of the representative taxon/sequence
Size - Size of the representative in unique k-mers (if printsize=t)
NodeCount - Number of sequences represented by this representative (if printclusters=t)
Nodes - Comma-separated list of all sequence IDs represented (if printclusters=t)

Memory and Performance

Memory Allocation

The pipeline allocates 31GB of RAM for each representative.sh run. For smaller datasets, this can be reduced:

# For smaller datasets, reduce memory allocation
# Edit the script and change -Xmx31g to a smaller value like -Xmx8g

Processing Time

Processing time depends on:

Input size - Number of sequences in the comparison file
Connectivity - Number of edges above threshold
Memory availability - Sufficient RAM reduces I/O overhead

For very large datasets (millions of sequences), consider running stages individually with different memory settings.

Silva Database Context

Purpose in Silva Processing

This pipeline is part of the Silva ribosomal RNA database processing workflow. It creates representative sets at multiple resolution levels to:

Reduce redundancy - Eliminate highly similar sequences while maintaining diversity
Enable multi-level analysis - Provide different granularities for taxonomic analysis
Optimize search databases - Create smaller, more manageable reference sets
Support hierarchical clustering - Generate nested representative sets for different applications

Integration with Other Silva Tools

This pipeline works in conjunction with other Silva processing scripts:

fetchSilva.sh - Downloads Silva database sequences
makeCoveringSetSsu.sh / makeCoveringSetLsu.sh - Creates covering sets for small/large subunit rRNA
make15mers.sh - Generates k-mer databases for rapid searching

Threshold Selection Guidelines

Biological Significance of Thresholds

95% ANI: Genus-level clustering. Groups sequences from closely related species within the same genus.
97% ANI: Species-level clustering. Standard threshold for bacterial species delimitation in 16S rRNA analysis.
98% ANI: High-confidence species clustering. Reduces false grouping of distinct species.
99% ANI: Strain-level clustering. Groups sequences from the same or very closely related strains.
99.5% ANI: Sub-strain clustering. Captures minute variations within strain populations.
99.9% ANI: Near-identical clustering. Groups only sequences with minimal differences, likely from the same organism or laboratory strain.

Customization Options

Modifying Thresholds

To customize the ANI thresholds, edit the makeRep.sh script and modify the thresh and minratio parameters:

# Example: Add a 96% threshold
representative.sh in=results_ani94_sr00_wkid.txt thresh=0.96 minratio=0.96 out=rep_96_96.txt ow -Xmx31g

Memory Optimization

For smaller datasets or memory-constrained systems:

# Reduce memory allocation (example for 8GB systems)
representative.sh in=results_ani94_sr00_wkid.txt thresh=0.95 minratio=0.95 out=rep_95_95.txt ow -Xmx6g

Additional Filtering

The representative.sh tool supports additional filtering options that can be added to each stage:

Size filtering: minsize=X maxsize=Y (filter by unique k-mers)
Bases filtering: minbases=X maxbases=Y (filter by total base pairs)
Taxonomic filtering: level=species ids=9606,10090 (filter by NCBI taxonomy)

Quality Control and Validation

Checking Output Quality

# Count representatives at each level
wc -l rep_*.txt

# Check coverage statistics by examining the log output
# Look for "Valid Nodes", "Invalid Nodes", and size statistics

# Compare representative counts across thresholds
# Higher thresholds should produce more representatives

Expected Behavior

Decreasing representative count - Lower thresholds (95%) should have fewer representatives than higher thresholds (99.9%)
Hierarchical containment - Representatives at 99% should generally be a superset of representatives at 95%
Size preservation - Total represented size should be consistent across threshold levels

Troubleshooting

Common Issues

Out of Memory Errors

If the pipeline runs out of memory:

Reduce the -Xmx parameter (e.g., change -Xmx31g to -Xmx16g)
Process subsets of the input file separately
Ensure sufficient system RAM is available

Input File Format Errors

If representative.sh reports format errors:

Verify the input file has tab-separated columns
Ensure ANI values are between 0.0 and 1.0
Check that sequence IDs are numeric (for taxonomic analysis)
Remove any header lines that don't start with #

Empty Output Files

If output files are empty or very small:

Check that input ANI values meet the threshold requirements
Verify size ratios meet the minratio requirements
Ensure the input file contains valid comparison data

Integration with Downstream Analysis

Using Representative Sets

The generated representative sets can be used for:

Database construction - Building non-redundant search databases
Taxonomic assignment - Rapid classification using reduced sequence sets
Phylogenetic analysis - Representative sampling for tree construction
Comparative genomics - Selecting diverse representatives for analysis

Downstream BBTools Integration

# Extract representative sequences from Silva FASTA
filterbyname.sh in=silva_sequences.fasta out=silva_representatives_97.fasta names=rep_97_97.txt include=t

# Build sketch database from representatives
sketch.sh in=silva_representatives_97.fasta out=silva_sketch_97.sketch

Notes and Best Practices

Threshold selection - Choose thresholds based on your specific taxonomic resolution needs
Input quality - Ensure all-to-all comparisons are computed with appropriate parameters
Memory planning - Monitor memory usage, especially for large Silva releases
Validation - Compare output statistics across different threshold levels for consistency
Documentation - Keep records of which thresholds were used for reproducibility
Batch processing - For very large datasets, consider processing in smaller batches