Fetch Silva Pipeline
Automated pipeline for downloading, processing, and preparing SILVA ribosomal RNA database files (release 132) for taxonomic classification with BBTools. This script creates optimized reference databases from SILVA SSU and LSU sequences.
Overview
The Fetch Silva pipeline automates the complex process of downloading and preparing SILVA ribosomal RNA reference sequences for use with BBTools taxonomic classification tools. SILVA (from Latin silva, meaning forest) is a comprehensive, quality-controlled database of aligned ribosomal RNA sequences from the three domains of life.
This pipeline specifically processes SILVA release 132, downloading both Small Subunit (SSU, 16S) and Large Subunit (LSU, 23S/28S) reference sequences, then applying quality filtering, deduplication, taxonomic annotation, and sketch database generation.
Prerequisites
System Requirements
- BBTools suite with taxonomic database support
- At least 31GB RAM for processing steps
- Network connectivity to ftp.arb-silva.de
- pigz for parallel compression
- wget for file downloads
- Sufficient disk space (several GB for raw files and processed outputs)
Database Requirements
- BBTools taxonomic database must be accessible (TAXPATH="auto")
- Write permissions in current directory for processed files
- Access to /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/resources/ for final sketch placement (JGI systems)
Pipeline Stages
1. Download Phase
Downloads SILVA release 132 reference sequences from the official FTP server:
1.1 SSU Reference Download
wget -nv http://ftp.arb-silva.de/release_132/Exports/SILVA_132_SSURef_tax_silva_trunc.fasta.gz
Downloads Small Subunit (16S prokaryotic, 18S eukaryotic) reference sequences with taxonomic information.
1.2 LSU Reference Download
wget -nv http://ftp.arb-silva.de/release_132/Exports/SILVA_132_LSURef_tax_silva_trunc.fasta.gz
Downloads Large Subunit (23S prokaryotic, 28S eukaryotic) reference sequences with taxonomic information.
2. LSU Processing Chain
2.1 Format Conversion and Quality Control
reformat.sh in=SILVA_132_LSURef_tax_silva_trunc.fasta.gz out=lsu_utot.fa.gz \
fastawrap=4000 utot zl=9 qtrim=rl trimq=1 ow minlen=80 pigz=20
Converts to uppercase, trims low-quality ends (quality 1), removes sequences shorter than 80bp, and applies compression level 9 with 20 pigz threads.
2.2 Sequence Clustering and Reordering
clumpify.sh in=lsu_utot.fa.gz reorder groups=1 -Xmx31g \
fastawrap=4000 out=lsu_clumped.fa.gz zl=9 pigz=20
Clusters similar sequences together for better compression and processing efficiency, using 31GB memory.
2.3 Header Prefix Addition
rename.sh addprefix prefix="lcl|LSU" in=lsu_clumped.fa.gz \
out=lsu_prefix.fa.gz pigz=20 zl=9 fastawrap=4000
Adds "lcl|LSU" prefix to sequence headers for identification as Large Subunit sequences.
2.4 Taxonomic ID Conversion
gi2taxid.sh -Xmx8g tree=auto in=lsu_prefix.fa.gz out=lsu_renamed.fa.gz \
silva zl=9 pigz=20 taxpath=$TAXPATH
Converts SILVA taxonomic identifiers to NCBI taxonomy format using the BBTools taxonomic database.
2.5 Taxonomic Sorting
sortbyname.sh -Xmx31g in=lsu_renamed.fa.gz out=lsu_sorted.fa.gz \
ow taxa tree=auto fastawrap=4000 zl=9 pigz=20 allowtemp=f taxpath=$TAXPATH
Sorts sequences by taxonomic hierarchy for efficient access patterns during classification.
2.6 Deduplication
dedupe.sh in=lsu_sorted.fa.gz out=lsu_deduped100pct.fa.gz \
zl=9 pigz=20 ordered fastawrap=4000
Removes 100% identical sequences while maintaining taxonomic sorting order.
3. SSU Processing Chain
3.1 Format Conversion and Quality Control
reformat.sh in=SILVA_132_SSURef_tax_silva_trunc.fasta.gz out=ssu_utot.fa.gz \
fastawrap=4000 utot zl=9 qtrim=rl trimq=1 ow minlen=80 pigz=20
Identical processing to LSU: uppercase conversion, quality trimming, minimum length filtering.
3.2 Sequence Clustering and Reordering
clumpify.sh in=ssu_utot.fa.gz reorder groups=1 -Xmx31g \
fastawrap=4000 out=ssu_clumped.fa.gz zl=9 pigz=20
Clusters similar SSU sequences for processing efficiency.
3.3 Header Prefix Addition
rename.sh addprefix prefix="lcl|SSU" in=ssu_clumped.fa.gz \
out=ssu_prefix.fa.gz pigz=20 zl=9 fastawrap=4000
Adds "lcl|SSU" prefix to identify Small Subunit sequences.
3.4 Taxonomic ID Conversion
gi2taxid.sh -Xmx8g tree=auto in=ssu_prefix.fa.gz out=ssu_renamed.fa.gz \
silva zl=9 pigz=20 taxpath=$TAXPATH
Converts SILVA taxonomy to NCBI format for SSU sequences.
3.5 Taxonomic Sorting
sortbyname.sh -Xmx31g in=ssu_renamed.fa.gz out=ssu_sorted.fa.gz \
ow taxa tree=auto fastawrap=4000 zl=9 pigz=20 allowtemp=f taxpath=$TAXPATH
Sorts SSU sequences by taxonomic hierarchy.
3.6 Deduplication
dedupe.sh in=ssu_sorted.fa.gz out=ssu_deduped_sorted.fa.gz \
zl=9 pigz=20 ordered fastawrap=4000
Removes identical SSU sequences while preserving order.
4. Database Merging Phase
4.1 Combined Sorted Database
cat ssu_sorted.fa.gz lsu_sorted.fa.gz > both_sorted_temp.fa.gz
sortbyname.sh -Xmx31g in=both_sorted_temp.fa.gz out=both_sorted.fa.gz \
ow taxa tree=auto fastawrap=4000 zl=9 pigz=20 allowtemp=f taxpath=$TAXPATH
Combines SSU and LSU databases and re-sorts by taxonomy for unified access.
4.2 Combined Deduplicated Database
cat ssu_deduped_sorted.fa.gz lsu_deduped_sorted.fa.gz > both_deduped_sorted_temp.fa.gz
sortbyname.sh -Xmx31g in=both_deduped_sorted_temp.fa.gz out=both_deduped_sorted.fa.gz \
ow taxa tree=auto fastawrap=4000 zl=9 pigz=20 allowtemp=f taxpath=$TAXPATH
Creates the primary combined database with duplicates removed and taxonomic sorting maintained.
4.3 Taxonomic Filtering
filterbytaxa.sh include=f in=both_deduped_sorted.fa.gz \
id=2323,256318,12908 tree=auto requirepresent=f \
out=both_deduped_sorted_no_unclassified_bacteria.fa.gz zl=9 fastawrap=9999 ow
Excludes specific taxonomic groups (IDs: 2323, 256318, 12908) that represent unclassified bacterial sequences to improve classification specificity.
5. Sketch Database Generation
5.1 Species-Level Blacklist Generation
sketchblacklist.sh -Xmx16g in=both_deduped_sorted_no_unclassified_bacteria.fa.gz \
prefilter=f tree=auto taxa silva taxlevel=species ow \
out=blacklist_silva_species_500.sketch mincount=500
Creates a blacklist sketch at species level with minimum count threshold of 500 to filter out low-abundance taxa from sketch matching.
5.2 Genus-Level Blacklist Generation
sketchblacklist.sh -Xmx16g in=both_deduped_sorted_no_unclassified_bacteria.fa.gz \
prefilter=f tree=auto taxa silva taxlevel=genus ow \
out=blacklist_silva_genus_200.sketch mincount=200
Creates a genus-level blacklist with lower threshold (200) for broader taxonomic filtering.
5.3 Blacklist Merging
mergesketch.sh -Xmx1g in=blacklist_silva_species_500.sketch,blacklist_silva_genus_200.sketch \
out=blacklist_silva_merged.sketch
Combines species and genus blacklists into a single comprehensive filter for sketch generation.
5.4 Taxonomic Sketch Database Creation
sketch.sh files=31 out=both_taxa#.sketch in=both_deduped_sorted.fa.gz \
size=200 maxgenomefraction=0.1 -Xmx8g tree=auto mode=taxa ow silva \
blacklist=blacklist_silva_merged.sketch autosize ow
Creates taxonomically-organized sketch databases split across 31 files. Uses taxa mode for taxonomic classification with maximum genome fraction of 0.1 and automatic sizing.
5.5 Sequence-Based Sketch Database Creation
sketch.sh files=31 out=both_seq#.sketch in=both_deduped_sorted.fa.gz \
size=200 maxgenomefraction=0.1 -Xmx8g tree=auto mode=sequence ow silva \
blacklist=blacklist_silva_merged.sketch autosize ow parsesubunit
Creates sequence-based sketch databases with subunit parsing enabled for detailed sequence matching and classification.
6. Installation Phase
cp blacklist_silva_merged.sketch /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/resources/
Copies the merged blacklist sketch to the BBTools resources directory for system-wide availability (JGI-specific path).
Basic Usage
# 1. Navigate to a working directory with sufficient space
cd /path/to/working/directory
# 2. Run the pipeline
bash fetchSilva.sh
# 3. Monitor progress - each step is timed for performance tracking
# Pipeline will create multiple output files and databases
Output Files
Downloaded Raw Files
- SILVA_132_SSURef_tax_silva_trunc.fasta.gz - Raw SSU reference sequences
- SILVA_132_LSURef_tax_silva_trunc.fasta.gz - Raw LSU reference sequences
Processed LSU Files
- lsu_utot.fa.gz - Quality-filtered LSU sequences
- lsu_clumped.fa.gz - Clustered and reordered LSU sequences
- lsu_prefix.fa.gz - LSU sequences with "lcl|LSU" prefix
- lsu_renamed.fa.gz - LSU sequences with NCBI taxonomic IDs
- lsu_sorted.fa.gz - Taxonomically sorted LSU sequences
- lsu_deduped100pct.fa.gz - Deduplicated LSU sequences
Processed SSU Files
- ssu_utot.fa.gz - Quality-filtered SSU sequences
- ssu_clumped.fa.gz - Clustered and reordered SSU sequences
- ssu_prefix.fa.gz - SSU sequences with "lcl|SSU" prefix
- ssu_renamed.fa.gz - SSU sequences with NCBI taxonomic IDs
- ssu_sorted.fa.gz - Taxonomically sorted SSU sequences
- ssu_deduped_sorted.fa.gz - Deduplicated and sorted SSU sequences
Combined Database Files
- both_sorted.fa.gz - Combined SSU+LSU database, taxonomically sorted
- both_deduped_sorted.fa.gz - Combined database with duplicates removed
- both_deduped_sorted_no_unclassified_bacteria.fa.gz - Filtered database excluding unclassified bacterial groups
Sketch Database Files
- blacklist_silva_species_500.sketch - Species-level blacklist (min count 500)
- blacklist_silva_genus_200.sketch - Genus-level blacklist (min count 200)
- blacklist_silva_merged.sketch - Combined blacklist for filtering
- both_taxa#.sketch - 31 taxonomic sketch files for classification
- both_seq#.sketch - 31 sequence sketch files for detailed matching
Algorithm Details
Database Processing Strategy
The pipeline uses a multi-stage approach to optimize SILVA data for BBTools:
Quality Control Phase
Initial processing removes low-quality sequences and standardizes format:
- Quality trimming: Removes bases with quality ≤1 from both ends
- Length filtering: Discards sequences shorter than 80bp
- Case normalization: Converts all sequences to uppercase
- Compression optimization: Uses level 9 compression with 20 parallel threads
Clustering and Organization
Clumpify reorders sequences to group similar ones together:
- Similarity clustering: Groups sequences by k-mer content
- Memory efficiency: Reduces memory requirements for downstream processing
- Compression improvement: Similar sequences compress better when adjacent
- Cache optimization: Improves access patterns for taxonomic tools
Taxonomic Integration
The gi2taxid conversion process maps SILVA identifiers to NCBI taxonomy:
- Cross-reference mapping: Links SILVA accessions to NCBI taxonomic IDs
- Hierarchical organization: Maintains taxonomic tree structure
- Consistency enforcement: Ensures compatibility with BBTools taxonomic framework
Sketch Database Optimization
The sketch generation creates efficient k-mer databases for rapid classification:
- Blacklist filtering: Removes over-represented k-mers that provide poor discrimination
- Multi-level thresholds: Species-level (500 count) and genus-level (200 count) filtering
- Dual indexing: Separate taxonomic and sequence-based databases for different use cases
- Distributed architecture: 31-file splitting enables parallel access and memory management
Performance Characteristics
- Memory usage: Peak 31GB for sorting operations, 16GB for sketch generation
- Processing time: Each major step is timed for performance monitoring
- Disk efficiency: Aggressive compression (level 9) throughout pipeline
- Parallel processing: 20 pigz threads for compression optimization
- Network efficiency: Non-verbose wget to minimize bandwidth overhead
Database Background
SILVA Database
SILVA is one of the most comprehensive and widely-used ribosomal RNA sequence databases:
- SSU sequences: Small subunit rRNA (16S for prokaryotes, 18S for eukaryotes)
- LSU sequences: Large subunit rRNA (23S for prokaryotes, 28S for eukaryotes)
- Quality control: Manual curation and alignment verification
- Taxonomic annotation: Comprehensive taxonomic assignments
- Release 132: Specific version used in this pipeline for reproducibility
BBTools Integration
This pipeline optimizes SILVA data for BBTools taxonomic classification tools:
- sendsketch.sh: Uses sketch databases for rapid taxonomic identification
- bbmap.sh: Can use sorted databases for alignment-based classification
- filterbyname.sh: Leverages taxonomic sorting for efficient filtering
- taxonomy tools: All benefit from standardized NCBI taxonomic format
Customization Options
Memory Adjustment
For systems with different memory configurations, adjust the -Xmx parameters:
- sortbyname.sh: Currently uses 31GB, can be reduced for smaller systems
- clumpify.sh: Currently uses 31GB for clustering operations
- sketchblacklist.sh: Uses 16GB, relatively memory-efficient
- gi2taxid.sh: Uses 8GB, typically sufficient for most systems
Compression Settings
Compression parameters can be modified for different speed/size tradeoffs:
- zl=9: Maximum compression, slower but smallest files
- pigz=20: 20 parallel compression threads, adjust for CPU count
- fastawrap=4000: Line wrapping for large sequences
Taxonomic Path Configuration
The TAXPATH variable controls taxonomic database location:
- TAXPATH="auto": Automatic detection of BBTools taxonomic database
- Custom path: Set to specific directory if auto-detection fails
Troubleshooting
Download Issues
- Network timeouts: SILVA FTP server may be temporarily unavailable
- File corruption: Verify downloaded file integrity if processing fails
- Disk space: Ensure sufficient space for both raw and processed files
Processing Failures
- Memory errors: Reduce -Xmx values or use systems with more RAM
- Taxonomic database: Ensure BBTools taxonomic database is properly installed
- Tool availability: Verify all BBTools components are in PATH
Performance Issues
- Slow compression: Reduce pigz thread count for systems with fewer CPUs
- Disk I/O bottlenecks: Use fast storage (SSD) for temporary files
- Network bandwidth: Initial downloads may be slow depending on connection
Integration with BBTools Workflow
Using Generated Databases
The processed databases can be used with various BBTools for taxonomic analysis:
Rapid Classification with Sketch
# Use taxonomic sketch database
sendsketch.sh in=query.fa.gz silva
# Use sequence sketch database for detailed matching
comparesketch.sh in=query.fa.gz ref=both_seq*.sketch silva
Alignment-based Classification
# Map against combined sorted database
bbmap.sh in=query.fq.gz ref=both_sorted.fa.gz silva
# Filter sequences by taxonomy
filterbyname.sh in=sequences.fa.gz ref=both_deduped_sorted.fa.gz silva
Database Maintenance
To update to newer SILVA releases:
- Modify wget URLs to point to newer release
- Update file names to match new release numbering
- Re-run the complete pipeline
- Update resource file locations as needed
Notes and Considerations
- Release specificity: This pipeline is hardcoded for SILVA release 132. URLs and filenames need updating for newer releases
- Resource requirements: Processing requires significant computational resources and time
- Reproducibility: Using a specific SILVA release ensures consistent results across analyses
- JGI integration: Final sketch copy step is specific to JGI computing infrastructure
- Error handling: Pipeline uses "set -e" to stop on any command failure
- Temporary files: Intermediate files are cleaned up automatically where possible
- Performance monitoring: Each major step is wrapped with "time" for execution tracking