Fetch Silva Pipeline

Overview

The Fetch Silva pipeline automates the complex process of downloading and preparing SILVA ribosomal RNA reference sequences for use with BBTools taxonomic classification tools. SILVA (from Latin silva, meaning forest) is a comprehensive, quality-controlled database of aligned ribosomal RNA sequences from the three domains of life.

This pipeline specifically processes SILVA release 132, downloading both Small Subunit (SSU, 16S) and Large Subunit (LSU, 23S/28S) reference sequences, then applying quality filtering, deduplication, taxonomic annotation, and sketch database generation.

Note: This pipeline requires substantial computational resources (up to 31GB RAM for some steps) and network bandwidth for the initial downloads. The final output includes sketch databases optimized for BBTools taxonomic classification.

Prerequisites

System Requirements

BBTools suite with taxonomic database support
At least 31GB RAM for processing steps
Network connectivity to ftp.arb-silva.de
pigz for parallel compression
wget for file downloads
Sufficient disk space (several GB for raw files and processed outputs)

Database Requirements

BBTools taxonomic database must be accessible (TAXPATH="auto")
Write permissions in current directory for processed files
Access to /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/resources/ for final sketch placement (JGI systems)

Pipeline Stages

1. Download Phase

Downloads SILVA release 132 reference sequences from the official FTP server:

1.1 SSU Reference Download

wget -nv http://ftp.arb-silva.de/release_132/Exports/SILVA_132_SSURef_tax_silva_trunc.fasta.gz

Downloads Small Subunit (16S prokaryotic, 18S eukaryotic) reference sequences with taxonomic information.

1.2 LSU Reference Download

wget -nv http://ftp.arb-silva.de/release_132/Exports/SILVA_132_LSURef_tax_silva_trunc.fasta.gz

Downloads Large Subunit (23S prokaryotic, 28S eukaryotic) reference sequences with taxonomic information.

2. LSU Processing Chain

2.1 Format Conversion and Quality Control

reformat.sh in=SILVA_132_LSURef_tax_silva_trunc.fasta.gz out=lsu_utot.fa.gz \
    fastawrap=4000 utot zl=9 qtrim=rl trimq=1 ow minlen=80 pigz=20

Converts to uppercase, trims low-quality ends (quality 1), removes sequences shorter than 80bp, and applies compression level 9 with 20 pigz threads.

2.2 Sequence Clustering and Reordering

clumpify.sh in=lsu_utot.fa.gz reorder groups=1 -Xmx31g \
    fastawrap=4000 out=lsu_clumped.fa.gz zl=9 pigz=20

Clusters similar sequences together for better compression and processing efficiency, using 31GB memory.

2.3 Header Prefix Addition

rename.sh addprefix prefix="lcl|LSU" in=lsu_clumped.fa.gz \
    out=lsu_prefix.fa.gz pigz=20 zl=9 fastawrap=4000

Adds "lcl|LSU" prefix to sequence headers for identification as Large Subunit sequences.

2.4 Taxonomic ID Conversion

gi2taxid.sh -Xmx8g tree=auto in=lsu_prefix.fa.gz out=lsu_renamed.fa.gz \
    silva zl=9 pigz=20 taxpath=$TAXPATH

Converts SILVA taxonomic identifiers to NCBI taxonomy format using the BBTools taxonomic database.

2.5 Taxonomic Sorting

sortbyname.sh -Xmx31g in=lsu_renamed.fa.gz out=lsu_sorted.fa.gz \
    ow taxa tree=auto fastawrap=4000 zl=9 pigz=20 allowtemp=f taxpath=$TAXPATH

Sorts sequences by taxonomic hierarchy for efficient access patterns during classification.

2.6 Deduplication

dedupe.sh in=lsu_sorted.fa.gz out=lsu_deduped100pct.fa.gz \
    zl=9 pigz=20 ordered fastawrap=4000

Removes 100% identical sequences while maintaining taxonomic sorting order.

3. SSU Processing Chain

3.1 Format Conversion and Quality Control

reformat.sh in=SILVA_132_SSURef_tax_silva_trunc.fasta.gz out=ssu_utot.fa.gz \
    fastawrap=4000 utot zl=9 qtrim=rl trimq=1 ow minlen=80 pigz=20

Identical processing to LSU: uppercase conversion, quality trimming, minimum length filtering.

3.2 Sequence Clustering and Reordering

clumpify.sh in=ssu_utot.fa.gz reorder groups=1 -Xmx31g \
    fastawrap=4000 out=ssu_clumped.fa.gz zl=9 pigz=20

Clusters similar SSU sequences for processing efficiency.

3.3 Header Prefix Addition

rename.sh addprefix prefix="lcl|SSU" in=ssu_clumped.fa.gz \
    out=ssu_prefix.fa.gz pigz=20 zl=9 fastawrap=4000

Adds "lcl|SSU" prefix to identify Small Subunit sequences.

3.4 Taxonomic ID Conversion

gi2taxid.sh -Xmx8g tree=auto in=ssu_prefix.fa.gz out=ssu_renamed.fa.gz \
    silva zl=9 pigz=20 taxpath=$TAXPATH

Converts SILVA taxonomy to NCBI format for SSU sequences.

3.5 Taxonomic Sorting

sortbyname.sh -Xmx31g in=ssu_renamed.fa.gz out=ssu_sorted.fa.gz \
    ow taxa tree=auto fastawrap=4000 zl=9 pigz=20 allowtemp=f taxpath=$TAXPATH

Sorts SSU sequences by taxonomic hierarchy.

3.6 Deduplication

dedupe.sh in=ssu_sorted.fa.gz out=ssu_deduped_sorted.fa.gz \
    zl=9 pigz=20 ordered fastawrap=4000

Removes identical SSU sequences while preserving order.

4. Database Merging Phase

4.1 Combined Sorted Database

cat ssu_sorted.fa.gz lsu_sorted.fa.gz > both_sorted_temp.fa.gz
sortbyname.sh -Xmx31g in=both_sorted_temp.fa.gz out=both_sorted.fa.gz \
    ow taxa tree=auto fastawrap=4000 zl=9 pigz=20 allowtemp=f taxpath=$TAXPATH

Combines SSU and LSU databases and re-sorts by taxonomy for unified access.

4.2 Combined Deduplicated Database

cat ssu_deduped_sorted.fa.gz lsu_deduped_sorted.fa.gz > both_deduped_sorted_temp.fa.gz
sortbyname.sh -Xmx31g in=both_deduped_sorted_temp.fa.gz out=both_deduped_sorted.fa.gz \
    ow taxa tree=auto fastawrap=4000 zl=9 pigz=20 allowtemp=f taxpath=$TAXPATH

Creates the primary combined database with duplicates removed and taxonomic sorting maintained.

4.3 Taxonomic Filtering

filterbytaxa.sh include=f in=both_deduped_sorted.fa.gz \
    id=2323,256318,12908 tree=auto requirepresent=f \
    out=both_deduped_sorted_no_unclassified_bacteria.fa.gz zl=9 fastawrap=9999 ow

Excludes specific taxonomic groups (IDs: 2323, 256318, 12908) that represent unclassified bacterial sequences to improve classification specificity.

5. Sketch Database Generation

5.1 Species-Level Blacklist Generation

sketchblacklist.sh -Xmx16g in=both_deduped_sorted_no_unclassified_bacteria.fa.gz \
    prefilter=f tree=auto taxa silva taxlevel=species ow \
    out=blacklist_silva_species_500.sketch mincount=500

Creates a blacklist sketch at species level with minimum count threshold of 500 to filter out low-abundance taxa from sketch matching.

5.2 Genus-Level Blacklist Generation

sketchblacklist.sh -Xmx16g in=both_deduped_sorted_no_unclassified_bacteria.fa.gz \
    prefilter=f tree=auto taxa silva taxlevel=genus ow \
    out=blacklist_silva_genus_200.sketch mincount=200

Creates a genus-level blacklist with lower threshold (200) for broader taxonomic filtering.

5.3 Blacklist Merging

mergesketch.sh -Xmx1g in=blacklist_silva_species_500.sketch,blacklist_silva_genus_200.sketch \
    out=blacklist_silva_merged.sketch

Combines species and genus blacklists into a single comprehensive filter for sketch generation.

5.4 Taxonomic Sketch Database Creation

sketch.sh files=31 out=both_taxa#.sketch in=both_deduped_sorted.fa.gz \
    size=200 maxgenomefraction=0.1 -Xmx8g tree=auto mode=taxa ow silva \
    blacklist=blacklist_silva_merged.sketch autosize ow

Creates taxonomically-organized sketch databases split across 31 files. Uses taxa mode for taxonomic classification with maximum genome fraction of 0.1 and automatic sizing.

5.5 Sequence-Based Sketch Database Creation

sketch.sh files=31 out=both_seq#.sketch in=both_deduped_sorted.fa.gz \
    size=200 maxgenomefraction=0.1 -Xmx8g tree=auto mode=sequence ow silva \
    blacklist=blacklist_silva_merged.sketch autosize ow parsesubunit

Creates sequence-based sketch databases with subunit parsing enabled for detailed sequence matching and classification.

6. Installation Phase

cp blacklist_silva_merged.sketch /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/resources/

Copies the merged blacklist sketch to the BBTools resources directory for system-wide availability (JGI-specific path).

Basic Usage

# 1. Navigate to a working directory with sufficient space
cd /path/to/working/directory

# 2. Run the pipeline
bash fetchSilva.sh

# 3. Monitor progress - each step is timed for performance tracking
# Pipeline will create multiple output files and databases

Warning: This pipeline downloads large files and requires substantial processing time. Ensure adequate disk space and computational resources before starting.

Output Files

Downloaded Raw Files

SILVA_132_SSURef_tax_silva_trunc.fasta.gz - Raw SSU reference sequences
SILVA_132_LSURef_tax_silva_trunc.fasta.gz - Raw LSU reference sequences

Processed LSU Files

lsu_utot.fa.gz - Quality-filtered LSU sequences
lsu_clumped.fa.gz - Clustered and reordered LSU sequences
lsu_prefix.fa.gz - LSU sequences with "lcl|LSU" prefix
lsu_renamed.fa.gz - LSU sequences with NCBI taxonomic IDs
lsu_sorted.fa.gz - Taxonomically sorted LSU sequences
lsu_deduped100pct.fa.gz - Deduplicated LSU sequences

Processed SSU Files

ssu_utot.fa.gz - Quality-filtered SSU sequences
ssu_clumped.fa.gz - Clustered and reordered SSU sequences
ssu_prefix.fa.gz - SSU sequences with "lcl|SSU" prefix
ssu_renamed.fa.gz - SSU sequences with NCBI taxonomic IDs
ssu_sorted.fa.gz - Taxonomically sorted SSU sequences
ssu_deduped_sorted.fa.gz - Deduplicated and sorted SSU sequences

Combined Database Files

both_sorted.fa.gz - Combined SSU+LSU database, taxonomically sorted
both_deduped_sorted.fa.gz - Combined database with duplicates removed
both_deduped_sorted_no_unclassified_bacteria.fa.gz - Filtered database excluding unclassified bacterial groups

Sketch Database Files

blacklist_silva_species_500.sketch - Species-level blacklist (min count 500)
blacklist_silva_genus_200.sketch - Genus-level blacklist (min count 200)
blacklist_silva_merged.sketch - Combined blacklist for filtering
both_taxa#.sketch - 31 taxonomic sketch files for classification
both_seq#.sketch - 31 sequence sketch files for detailed matching

Algorithm Details

Database Processing Strategy

The pipeline uses a multi-stage approach to optimize SILVA data for BBTools:

Quality Control Phase

Initial processing removes low-quality sequences and standardizes format:

Quality trimming: Removes bases with quality ≤1 from both ends
Length filtering: Discards sequences shorter than 80bp
Case normalization: Converts all sequences to uppercase
Compression optimization: Uses level 9 compression with 20 parallel threads

Clustering and Organization

Clumpify reorders sequences to group similar ones together:

Similarity clustering: Groups sequences by k-mer content
Memory efficiency: Reduces memory requirements for downstream processing
Compression improvement: Similar sequences compress better when adjacent
Cache optimization: Improves access patterns for taxonomic tools

Taxonomic Integration

The gi2taxid conversion process maps SILVA identifiers to NCBI taxonomy:

Cross-reference mapping: Links SILVA accessions to NCBI taxonomic IDs
Hierarchical organization: Maintains taxonomic tree structure
Consistency enforcement: Ensures compatibility with BBTools taxonomic framework

Sketch Database Optimization

The sketch generation creates efficient k-mer databases for rapid classification:

Blacklist filtering: Removes over-represented k-mers that provide poor discrimination
Multi-level thresholds: Species-level (500 count) and genus-level (200 count) filtering
Dual indexing: Separate taxonomic and sequence-based databases for different use cases
Distributed architecture: 31-file splitting enables parallel access and memory management

Performance Characteristics

Memory usage: Peak 31GB for sorting operations, 16GB for sketch generation
Processing time: Each major step is timed for performance monitoring
Disk efficiency: Aggressive compression (level 9) throughout pipeline
Parallel processing: 20 pigz threads for compression optimization
Network efficiency: Non-verbose wget to minimize bandwidth overhead

Database Background

SILVA Database

SILVA is one of the most comprehensive and widely-used ribosomal RNA sequence databases:

SSU sequences: Small subunit rRNA (16S for prokaryotes, 18S for eukaryotes)
LSU sequences: Large subunit rRNA (23S for prokaryotes, 28S for eukaryotes)
Quality control: Manual curation and alignment verification
Taxonomic annotation: Comprehensive taxonomic assignments
Release 132: Specific version used in this pipeline for reproducibility

BBTools Integration

This pipeline optimizes SILVA data for BBTools taxonomic classification tools:

sendsketch.sh: Uses sketch databases for rapid taxonomic identification
bbmap.sh: Can use sorted databases for alignment-based classification
filterbyname.sh: Leverages taxonomic sorting for efficient filtering
taxonomy tools: All benefit from standardized NCBI taxonomic format

Customization Options

Memory Adjustment

For systems with different memory configurations, adjust the -Xmx parameters:

sortbyname.sh: Currently uses 31GB, can be reduced for smaller systems
clumpify.sh: Currently uses 31GB for clustering operations
sketchblacklist.sh: Uses 16GB, relatively memory-efficient
gi2taxid.sh: Uses 8GB, typically sufficient for most systems

Compression Settings

Compression parameters can be modified for different speed/size tradeoffs:

zl=9: Maximum compression, slower but smallest files
pigz=20: 20 parallel compression threads, adjust for CPU count
fastawrap=4000: Line wrapping for large sequences

Taxonomic Path Configuration

The TAXPATH variable controls taxonomic database location:

TAXPATH="auto": Automatic detection of BBTools taxonomic database
Custom path: Set to specific directory if auto-detection fails

Troubleshooting

Download Issues

Network timeouts: SILVA FTP server may be temporarily unavailable
File corruption: Verify downloaded file integrity if processing fails
Disk space: Ensure sufficient space for both raw and processed files

Processing Failures

Memory errors: Reduce -Xmx values or use systems with more RAM
Taxonomic database: Ensure BBTools taxonomic database is properly installed
Tool availability: Verify all BBTools components are in PATH

Performance Issues

Slow compression: Reduce pigz thread count for systems with fewer CPUs
Disk I/O bottlenecks: Use fast storage (SSD) for temporary files
Network bandwidth: Initial downloads may be slow depending on connection

Integration with BBTools Workflow

Using Generated Databases

The processed databases can be used with various BBTools for taxonomic analysis:

Rapid Classification with Sketch

# Use taxonomic sketch database
sendsketch.sh in=query.fa.gz silva

# Use sequence sketch database for detailed matching
comparesketch.sh in=query.fa.gz ref=both_seq*.sketch silva

Alignment-based Classification

# Map against combined sorted database
bbmap.sh in=query.fq.gz ref=both_sorted.fa.gz silva

# Filter sequences by taxonomy
filterbyname.sh in=sequences.fa.gz ref=both_deduped_sorted.fa.gz silva

Database Maintenance

To update to newer SILVA releases:

Modify wget URLs to point to newer release
Update file names to match new release numbering
Re-run the complete pipeline
Update resource file locations as needed

Notes and Considerations

Release specificity: This pipeline is hardcoded for SILVA release 132. URLs and filenames need updating for newer releases
Resource requirements: Processing requires significant computational resources and time
Reproducibility: Using a specific SILVA release ensures consistent results across analyses
JGI integration: Final sketch copy step is specific to JGI computing infrastructure
Error handling: Pipeline uses "set -e" to stop on any command failure
Temporary files: Intermediate files are cleaned up automatically where possible
Performance monitoring: Each major step is wrapped with "time" for execution tracking