makeRep Pipeline
Silva database representative set generation pipeline that creates representative taxa sets at multiple ANI (Average Nucleotide Identity) thresholds. This pipeline processes all-to-all identity comparison results to generate minimal representative sets of sequences at different similarity levels.
Overview
This pipeline is designed to process Silva ribosomal RNA database sequences for creating representative sets at multiple taxonomic resolution levels. It takes pre-computed all-to-all ANI comparison results and generates minimal representative sets where all original sequences are within a specified distance of at least one representative.
The pipeline runs the representative.sh
tool six times with progressively stringent thresholds:
- 95% ANI / 95% ratio - Species-level clustering
- 97% ANI / 97% ratio - High-confidence species clustering
- 98% ANI / 98% ratio - Very stringent species clustering
- 99% ANI / 99% ratio - Near-identical sequence clustering
- 99.5% ANI / 99.5% ratio - Sub-strain level clustering
- 99.9% ANI / 99.9% ratio - Nearly perfect sequence clustering
Prerequisites
Input Requirements
- results_ani94_sr00_wkid.txt - All-to-all identity comparison results in TSV format
- Input file must contain at least 3 columns: query, reference, ANI
- Optional columns: qsize, rsize, qbases, rbases (as produced by CompareSketch)
- File should be generated using CompareSketch with format=3 and usetaxidname
System Requirements
- BBTools suite installed with representative.sh available
- Sufficient memory (31GB allocated by default)
- Working directory with write permissions
Pipeline Stages
Stage 1: 95% ANI Representative Set
representative.sh in=results_ani94_sr00_wkid.txt thresh=0.95 minratio=0.95 out=rep_95_95.txt ow -Xmx31g
Creates a representative set where sequences are clustered at 95% Average Nucleotide Identity with a minimum size ratio of 95%. This produces the most inclusive clustering, suitable for species-level grouping.
Stage 2: 97% ANI Representative Set
representative.sh in=results_ani94_sr00_wkid.txt thresh=0.97 minratio=0.97 out=rep_97_97.txt ow -Xmx31g
Creates a more stringent representative set at 97% ANI threshold with 97% size ratio requirement. This level is commonly used for bacterial species delimitation.
Stage 3: 98% ANI Representative Set
representative.sh in=results_ani94_sr00_wkid.txt thresh=0.98 minratio=0.98 out=rep_98_98.txt ow -Xmx31g
Higher confidence species clustering at 98% ANI threshold. Provides tighter taxonomic grouping with increased specificity.
Stage 4: 99% ANI Representative Set
representative.sh in=results_ani94_sr00_wkid.txt thresh=0.99 minratio=0.99 out=rep_99_99.txt ow -Xmx31g
Near-identical sequence clustering at 99% ANI. Suitable for identifying very closely related strains or subspecies.
Stage 5: 99.5% ANI Representative Set
representative.sh in=results_ani94_sr00_wkid.txt thresh=0.995 minratio=0.995 out=rep_995_995.txt ow -Xmx31g
Sub-strain level clustering at 99.5% ANI threshold. Captures minute sequence variations while maintaining representative structure.
Stage 6: 99.9% ANI Representative Set
representative.sh in=results_ani94_sr00_wkid.txt thresh=0.999 minratio=0.999 out=rep_999_999.txt ow -Xmx31g
Nearly perfect sequence clustering at 99.9% ANI. This produces the most fine-grained representative set, distinguishing sequences with minimal differences.
Algorithm Details
Representative Set Selection Algorithm
The underlying representative.sh
tool implements a graph-based algorithm for minimal representative set selection:
Node Selection Strategy
- Score-based ranking: Nodes are sorted by a composite score combining edge weights and logarithmic size bonus
- Greedy selection: Higher-scoring nodes are selected first as representatives
- Coverage guarantee: Each original node must be within threshold distance of at least one representative
- Singleton handling: Isolated nodes are automatically included as representatives
Filtering Mechanisms
- Distance threshold: Edges below the ANI threshold are ignored
- Size ratio filtering: Edges between sequences with size ratios below minratio are discarded
- Size-based filtering: Nodes can be filtered by estimated size in unique k-mers or total bases
- Taxonomic filtering: Optional filtering by NCBI taxonomic IDs or names
Scoring Function
Node priority is determined by: score = sum(edge_weights) + 0.25 * sum(edge_weights) * log(node_size)
This scoring favors nodes with many high-quality connections while giving a size bonus to larger sequences, ensuring representatives are well-connected and substantial.
Basic Usage
# 1. Ensure input file is available
# Input should be from CompareSketch with format=3 and usetaxidname
ls results_ani94_sr00_wkid.txt
# 2. Run the pipeline
bash makeRep.sh
# 3. Output files will be generated at each threshold level
Input File Format
The pipeline expects a tab-separated values file with the following columns:
- Query ID - NCBI taxonomic ID or sequence identifier
- Reference ID - NCBI taxonomic ID or sequence identifier
- ANI - Average Nucleotide Identity (0.0 to 1.0)
- Query Size - Query sequence size in unique k-mers (optional)
- Reference Size - Reference sequence size in unique k-mers (optional)
- Query Bases - Query sequence length in base pairs (optional)
- Reference Bases - Reference sequence length in base pairs (optional)
Example Input
Query Ref ANI QSize RSize QBases RBases
9606 10090 0.956 1500000 1400000 3000000000 2800000000
562 83333 0.982 850000 820000 4600000 4500000
Output Files
Generated Files
- rep_95_95.txt - Representative set at 95% ANI/ratio threshold
- rep_97_97.txt - Representative set at 97% ANI/ratio threshold
- rep_98_98.txt - Representative set at 98% ANI/ratio threshold
- rep_99_99.txt - Representative set at 99% ANI/ratio threshold
- rep_995_995.txt - Representative set at 99.5% ANI/ratio threshold
- rep_999_999.txt - Representative set at 99.9% ANI/ratio threshold
Output Format
Each output file contains representative taxa IDs with optional size and cluster information:
#Representative Size NodeCount Nodes
9606 1500000 3 9606,10090,9598
562 850000 2 562,83333
Output Columns
- Representative - ID of the representative taxon/sequence
- Size - Size of the representative in unique k-mers (if printsize=t)
- NodeCount - Number of sequences represented by this representative (if printclusters=t)
- Nodes - Comma-separated list of all sequence IDs represented (if printclusters=t)
Memory and Performance
Memory Allocation
The pipeline allocates 31GB of RAM for each representative.sh run. For smaller datasets, this can be reduced:
# For smaller datasets, reduce memory allocation
# Edit the script and change -Xmx31g to a smaller value like -Xmx8g
Processing Time
Processing time depends on:
- Input size - Number of sequences in the comparison file
- Connectivity - Number of edges above threshold
- Memory availability - Sufficient RAM reduces I/O overhead
For very large datasets (millions of sequences), consider running stages individually with different memory settings.
Silva Database Context
Purpose in Silva Processing
This pipeline is part of the Silva ribosomal RNA database processing workflow. It creates representative sets at multiple resolution levels to:
- Reduce redundancy - Eliminate highly similar sequences while maintaining diversity
- Enable multi-level analysis - Provide different granularities for taxonomic analysis
- Optimize search databases - Create smaller, more manageable reference sets
- Support hierarchical clustering - Generate nested representative sets for different applications
Integration with Other Silva Tools
This pipeline works in conjunction with other Silva processing scripts:
- fetchSilva.sh - Downloads Silva database sequences
- makeCoveringSetSsu.sh / makeCoveringSetLsu.sh - Creates covering sets for small/large subunit rRNA
- make15mers.sh - Generates k-mer databases for rapid searching
Threshold Selection Guidelines
Biological Significance of Thresholds
- 95% ANI
- Genus-level clustering. Groups sequences from closely related species within the same genus.
- 97% ANI
- Species-level clustering. Standard threshold for bacterial species delimitation in 16S rRNA analysis.
- 98% ANI
- High-confidence species clustering. Reduces false grouping of distinct species.
- 99% ANI
- Strain-level clustering. Groups sequences from the same or very closely related strains.
- 99.5% ANI
- Sub-strain clustering. Captures minute variations within strain populations.
- 99.9% ANI
- Near-identical clustering. Groups only sequences with minimal differences, likely from the same organism or laboratory strain.
Customization Options
Modifying Thresholds
To customize the ANI thresholds, edit the makeRep.sh script and modify the thresh and minratio parameters:
# Example: Add a 96% threshold
representative.sh in=results_ani94_sr00_wkid.txt thresh=0.96 minratio=0.96 out=rep_96_96.txt ow -Xmx31g
Memory Optimization
For smaller datasets or memory-constrained systems:
# Reduce memory allocation (example for 8GB systems)
representative.sh in=results_ani94_sr00_wkid.txt thresh=0.95 minratio=0.95 out=rep_95_95.txt ow -Xmx6g
Additional Filtering
The representative.sh tool supports additional filtering options that can be added to each stage:
- Size filtering: minsize=X maxsize=Y (filter by unique k-mers)
- Bases filtering: minbases=X maxbases=Y (filter by total base pairs)
- Taxonomic filtering: level=species ids=9606,10090 (filter by NCBI taxonomy)
Quality Control and Validation
Checking Output Quality
# Count representatives at each level
wc -l rep_*.txt
# Check coverage statistics by examining the log output
# Look for "Valid Nodes", "Invalid Nodes", and size statistics
# Compare representative counts across thresholds
# Higher thresholds should produce more representatives
Expected Behavior
- Decreasing representative count - Lower thresholds (95%) should have fewer representatives than higher thresholds (99.9%)
- Hierarchical containment - Representatives at 99% should generally be a superset of representatives at 95%
- Size preservation - Total represented size should be consistent across threshold levels
Troubleshooting
Common Issues
Out of Memory Errors
If the pipeline runs out of memory:
- Reduce the -Xmx parameter (e.g., change -Xmx31g to -Xmx16g)
- Process subsets of the input file separately
- Ensure sufficient system RAM is available
Input File Format Errors
If representative.sh reports format errors:
- Verify the input file has tab-separated columns
- Ensure ANI values are between 0.0 and 1.0
- Check that sequence IDs are numeric (for taxonomic analysis)
- Remove any header lines that don't start with #
Empty Output Files
If output files are empty or very small:
- Check that input ANI values meet the threshold requirements
- Verify size ratios meet the minratio requirements
- Ensure the input file contains valid comparison data
Integration with Downstream Analysis
Using Representative Sets
The generated representative sets can be used for:
- Database construction - Building non-redundant search databases
- Taxonomic assignment - Rapid classification using reduced sequence sets
- Phylogenetic analysis - Representative sampling for tree construction
- Comparative genomics - Selecting diverse representatives for analysis
Downstream BBTools Integration
# Extract representative sequences from Silva FASTA
filterbyname.sh in=silva_sequences.fasta out=silva_representatives_97.fasta names=rep_97_97.txt include=t
# Build sketch database from representatives
sketch.sh in=silva_representatives_97.fasta out=silva_sketch_97.sketch
Notes and Best Practices
- Threshold selection - Choose thresholds based on your specific taxonomic resolution needs
- Input quality - Ensure all-to-all comparisons are computed with appropriate parameters
- Memory planning - Monitor memory usage, especially for large Silva releases
- Validation - Compare output statistics across different threshold levels for consistency
- Documentation - Keep records of which thresholds were used for reproducibility
- Batch processing - For very large datasets, consider processing in smaller batches