Fetch Silva Pipeline

Script: fetchSilva.sh Source Directory: pipelines/silva/ Author: Brian Bushnell

Automated pipeline for downloading, processing, and preparing SILVA ribosomal RNA database files (release 132) for taxonomic classification with BBTools. This script creates optimized reference databases from SILVA SSU and LSU sequences.

Overview

The Fetch Silva pipeline automates the complex process of downloading and preparing SILVA ribosomal RNA reference sequences for use with BBTools taxonomic classification tools. SILVA (from Latin silva, meaning forest) is a comprehensive, quality-controlled database of aligned ribosomal RNA sequences from the three domains of life.

This pipeline specifically processes SILVA release 132, downloading both Small Subunit (SSU, 16S) and Large Subunit (LSU, 23S/28S) reference sequences, then applying quality filtering, deduplication, taxonomic annotation, and sketch database generation.

Note: This pipeline requires substantial computational resources (up to 31GB RAM for some steps) and network bandwidth for the initial downloads. The final output includes sketch databases optimized for BBTools taxonomic classification.

Prerequisites

System Requirements

Database Requirements

Pipeline Stages

1. Download Phase

Downloads SILVA release 132 reference sequences from the official FTP server:

1.1 SSU Reference Download

wget -nv http://ftp.arb-silva.de/release_132/Exports/SILVA_132_SSURef_tax_silva_trunc.fasta.gz

Downloads Small Subunit (16S prokaryotic, 18S eukaryotic) reference sequences with taxonomic information.

1.2 LSU Reference Download

wget -nv http://ftp.arb-silva.de/release_132/Exports/SILVA_132_LSURef_tax_silva_trunc.fasta.gz

Downloads Large Subunit (23S prokaryotic, 28S eukaryotic) reference sequences with taxonomic information.

2. LSU Processing Chain

2.1 Format Conversion and Quality Control

reformat.sh in=SILVA_132_LSURef_tax_silva_trunc.fasta.gz out=lsu_utot.fa.gz \
    fastawrap=4000 utot zl=9 qtrim=rl trimq=1 ow minlen=80 pigz=20

Converts to uppercase, trims low-quality ends (quality 1), removes sequences shorter than 80bp, and applies compression level 9 with 20 pigz threads.

2.2 Sequence Clustering and Reordering

clumpify.sh in=lsu_utot.fa.gz reorder groups=1 -Xmx31g \
    fastawrap=4000 out=lsu_clumped.fa.gz zl=9 pigz=20

Clusters similar sequences together for better compression and processing efficiency, using 31GB memory.

2.3 Header Prefix Addition

rename.sh addprefix prefix="lcl|LSU" in=lsu_clumped.fa.gz \
    out=lsu_prefix.fa.gz pigz=20 zl=9 fastawrap=4000

Adds "lcl|LSU" prefix to sequence headers for identification as Large Subunit sequences.

2.4 Taxonomic ID Conversion

gi2taxid.sh -Xmx8g tree=auto in=lsu_prefix.fa.gz out=lsu_renamed.fa.gz \
    silva zl=9 pigz=20 taxpath=$TAXPATH

Converts SILVA taxonomic identifiers to NCBI taxonomy format using the BBTools taxonomic database.

2.5 Taxonomic Sorting

sortbyname.sh -Xmx31g in=lsu_renamed.fa.gz out=lsu_sorted.fa.gz \
    ow taxa tree=auto fastawrap=4000 zl=9 pigz=20 allowtemp=f taxpath=$TAXPATH

Sorts sequences by taxonomic hierarchy for efficient access patterns during classification.

2.6 Deduplication

dedupe.sh in=lsu_sorted.fa.gz out=lsu_deduped100pct.fa.gz \
    zl=9 pigz=20 ordered fastawrap=4000

Removes 100% identical sequences while maintaining taxonomic sorting order.

3. SSU Processing Chain

3.1 Format Conversion and Quality Control

reformat.sh in=SILVA_132_SSURef_tax_silva_trunc.fasta.gz out=ssu_utot.fa.gz \
    fastawrap=4000 utot zl=9 qtrim=rl trimq=1 ow minlen=80 pigz=20

Identical processing to LSU: uppercase conversion, quality trimming, minimum length filtering.

3.2 Sequence Clustering and Reordering

clumpify.sh in=ssu_utot.fa.gz reorder groups=1 -Xmx31g \
    fastawrap=4000 out=ssu_clumped.fa.gz zl=9 pigz=20

Clusters similar SSU sequences for processing efficiency.

3.3 Header Prefix Addition

rename.sh addprefix prefix="lcl|SSU" in=ssu_clumped.fa.gz \
    out=ssu_prefix.fa.gz pigz=20 zl=9 fastawrap=4000

Adds "lcl|SSU" prefix to identify Small Subunit sequences.

3.4 Taxonomic ID Conversion

gi2taxid.sh -Xmx8g tree=auto in=ssu_prefix.fa.gz out=ssu_renamed.fa.gz \
    silva zl=9 pigz=20 taxpath=$TAXPATH

Converts SILVA taxonomy to NCBI format for SSU sequences.

3.5 Taxonomic Sorting

sortbyname.sh -Xmx31g in=ssu_renamed.fa.gz out=ssu_sorted.fa.gz \
    ow taxa tree=auto fastawrap=4000 zl=9 pigz=20 allowtemp=f taxpath=$TAXPATH

Sorts SSU sequences by taxonomic hierarchy.

3.6 Deduplication

dedupe.sh in=ssu_sorted.fa.gz out=ssu_deduped_sorted.fa.gz \
    zl=9 pigz=20 ordered fastawrap=4000

Removes identical SSU sequences while preserving order.

4. Database Merging Phase

4.1 Combined Sorted Database

cat ssu_sorted.fa.gz lsu_sorted.fa.gz > both_sorted_temp.fa.gz
sortbyname.sh -Xmx31g in=both_sorted_temp.fa.gz out=both_sorted.fa.gz \
    ow taxa tree=auto fastawrap=4000 zl=9 pigz=20 allowtemp=f taxpath=$TAXPATH

Combines SSU and LSU databases and re-sorts by taxonomy for unified access.

4.2 Combined Deduplicated Database

cat ssu_deduped_sorted.fa.gz lsu_deduped_sorted.fa.gz > both_deduped_sorted_temp.fa.gz
sortbyname.sh -Xmx31g in=both_deduped_sorted_temp.fa.gz out=both_deduped_sorted.fa.gz \
    ow taxa tree=auto fastawrap=4000 zl=9 pigz=20 allowtemp=f taxpath=$TAXPATH

Creates the primary combined database with duplicates removed and taxonomic sorting maintained.

4.3 Taxonomic Filtering

filterbytaxa.sh include=f in=both_deduped_sorted.fa.gz \
    id=2323,256318,12908 tree=auto requirepresent=f \
    out=both_deduped_sorted_no_unclassified_bacteria.fa.gz zl=9 fastawrap=9999 ow

Excludes specific taxonomic groups (IDs: 2323, 256318, 12908) that represent unclassified bacterial sequences to improve classification specificity.

5. Sketch Database Generation

5.1 Species-Level Blacklist Generation

sketchblacklist.sh -Xmx16g in=both_deduped_sorted_no_unclassified_bacteria.fa.gz \
    prefilter=f tree=auto taxa silva taxlevel=species ow \
    out=blacklist_silva_species_500.sketch mincount=500

Creates a blacklist sketch at species level with minimum count threshold of 500 to filter out low-abundance taxa from sketch matching.

5.2 Genus-Level Blacklist Generation

sketchblacklist.sh -Xmx16g in=both_deduped_sorted_no_unclassified_bacteria.fa.gz \
    prefilter=f tree=auto taxa silva taxlevel=genus ow \
    out=blacklist_silva_genus_200.sketch mincount=200

Creates a genus-level blacklist with lower threshold (200) for broader taxonomic filtering.

5.3 Blacklist Merging

mergesketch.sh -Xmx1g in=blacklist_silva_species_500.sketch,blacklist_silva_genus_200.sketch \
    out=blacklist_silva_merged.sketch

Combines species and genus blacklists into a single comprehensive filter for sketch generation.

5.4 Taxonomic Sketch Database Creation

sketch.sh files=31 out=both_taxa#.sketch in=both_deduped_sorted.fa.gz \
    size=200 maxgenomefraction=0.1 -Xmx8g tree=auto mode=taxa ow silva \
    blacklist=blacklist_silva_merged.sketch autosize ow

Creates taxonomically-organized sketch databases split across 31 files. Uses taxa mode for taxonomic classification with maximum genome fraction of 0.1 and automatic sizing.

5.5 Sequence-Based Sketch Database Creation

sketch.sh files=31 out=both_seq#.sketch in=both_deduped_sorted.fa.gz \
    size=200 maxgenomefraction=0.1 -Xmx8g tree=auto mode=sequence ow silva \
    blacklist=blacklist_silva_merged.sketch autosize ow parsesubunit

Creates sequence-based sketch databases with subunit parsing enabled for detailed sequence matching and classification.

6. Installation Phase

cp blacklist_silva_merged.sketch /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/resources/

Copies the merged blacklist sketch to the BBTools resources directory for system-wide availability (JGI-specific path).

Basic Usage

# 1. Navigate to a working directory with sufficient space
cd /path/to/working/directory

# 2. Run the pipeline
bash fetchSilva.sh

# 3. Monitor progress - each step is timed for performance tracking
# Pipeline will create multiple output files and databases
Warning: This pipeline downloads large files and requires substantial processing time. Ensure adequate disk space and computational resources before starting.

Output Files

Downloaded Raw Files

Processed LSU Files

Processed SSU Files

Combined Database Files

Sketch Database Files

Algorithm Details

Database Processing Strategy

The pipeline uses a multi-stage approach to optimize SILVA data for BBTools:

Quality Control Phase

Initial processing removes low-quality sequences and standardizes format:

Clustering and Organization

Clumpify reorders sequences to group similar ones together:

Taxonomic Integration

The gi2taxid conversion process maps SILVA identifiers to NCBI taxonomy:

Sketch Database Optimization

The sketch generation creates efficient k-mer databases for rapid classification:

Performance Characteristics

Database Background

SILVA Database

SILVA is one of the most comprehensive and widely-used ribosomal RNA sequence databases:

BBTools Integration

This pipeline optimizes SILVA data for BBTools taxonomic classification tools:

Customization Options

Memory Adjustment

For systems with different memory configurations, adjust the -Xmx parameters:

Compression Settings

Compression parameters can be modified for different speed/size tradeoffs:

Taxonomic Path Configuration

The TAXPATH variable controls taxonomic database location:

Troubleshooting

Download Issues

Processing Failures

Performance Issues

Integration with BBTools Workflow

Using Generated Databases

The processed databases can be used with various BBTools for taxonomic analysis:

Rapid Classification with Sketch

# Use taxonomic sketch database
sendsketch.sh in=query.fa.gz silva

# Use sequence sketch database for detailed matching
comparesketch.sh in=query.fa.gz ref=both_seq*.sketch silva

Alignment-based Classification

# Map against combined sorted database
bbmap.sh in=query.fq.gz ref=both_sorted.fa.gz silva

# Filter sequences by taxonomy
filterbyname.sh in=sequences.fa.gz ref=both_deduped_sorted.fa.gz silva

Database Maintenance

To update to newer SILVA releases:

  1. Modify wget URLs to point to newer release
  2. Update file names to match new release numbering
  3. Re-run the complete pipeline
  4. Update resource file locations as needed

Notes and Considerations