ReduceSilva

Script: reducesilva.sh Package: driver Class: ReduceSilva.java

Reduces Silva entries down to one entry per taxa by keeping only the first occurrence of each taxonomic group. The tool splits semicolon-delimited taxonomic names and filters sequences based on the specified taxonomic level, assuming the standard format: kingdom;phylum;class;order;family;genus;species.

Basic Usage

reducesilva.sh in=<file> out=<file> column=<1>

The tool reads FASTA sequences with semicolon-delimited taxonomic information in the headers and outputs a reduced set containing only the first representative of each taxonomic group at the specified level.

Parameters

Parameters are organized by function in the reduction process.

Core Parameters

column: The taxonomic level to filter by. 0=species, 1=genus, 2=family, 3=order, 4=class, 5=phylum, 6=kingdom. The tool counts from the right end of the semicolon-delimited string. Default: 1 (genus level).
ow=f: (overwrite) Overwrites output files that already exist. Default: false.
zl=4: (ziplevel) Set compression level for output files, 1 (low compression, fast) to 9 (maximum compression, slow). Default: 4.
fastawrap=70: Length of lines in FASTA output. Sequences longer than this value will be wrapped to multiple lines. Default: 70 characters.

Sampling Parameters

reads=-1: Set to a positive number to only process this many INPUT sequences, then quit. Useful for testing or processing large datasets in chunks. Default: -1 (process all sequences).

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 1g for this tool.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions. May provide a small performance boost in production environments.

Examples

Basic Genus-Level Reduction

reducesilva.sh in=silva_full.fasta out=silva_genus.fasta column=1

Reduces Silva database to one representative per genus. For sequences with headers like "Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;E.coli", only the first E.coli sequence encountered will be kept.

Species-Level Reduction

reducesilva.sh in=silva_full.fasta out=silva_species.fasta column=0

Keeps only one representative per species (finest taxonomic resolution).

Family-Level Reduction with Compression

reducesilva.sh in=silva_full.fasta out=silva_family.fasta.gz column=2 zl=9

Reduces to family level with maximum compression. Useful for creating compact reference databases.

Testing with Limited Input

reducesilva.sh in=silva_full.fasta out=test_output.fasta column=1 reads=1000

Process only the first 1000 sequences for testing parameters or workflow validation.

Algorithm Details

Taxonomic Processing Strategy

ReduceSilva uses a hash-based deduplication approach to efficiently filter taxonomic sequences:

Parsing Method

Header Splitting: Sequence headers are split on semicolons to extract taxonomic hierarchy
Level Selection: The taxonomic level is selected by counting from the right end (species=0, genus=1, family=2, etc.)
Reliability Limitation: The tool assumes standard Silva format (kingdom;phylum;class;order;family;genus;species) and may not work reliably with non-standard taxonomic strings

Deduplication Algorithm

Hash Table Tracking: Uses a HashSet<String> to track previously seen taxonomic names at the specified level
First-Occurrence Rule: Only the first sequence encountered for each taxonomic group is retained
Memory Efficiency: Stores only the taxonomic names (not full sequences) in memory for tracking
Streaming Processing: Processes sequences one at a time, allowing handling of very large databases

Performance Characteristics

Memory Usage: Scales with the number of unique taxa at the specified level, not total sequence count
Processing Speed: Linear time complexity O(n) where n is the number of input sequences
I/O Efficiency: Supports compressed input/output and streaming processing for large files
Default Memory: Allocates 1GB RAM by default, suitable for most Silva database sizes

Use Cases

Database Curation: Creating non-redundant reference databases from Silva
Phylogenetic Analysis: Reducing computational complexity by limiting to one representative per taxonomic group
Pipeline Preprocessing: Preparing simplified reference sets for downstream analysis
Storage Optimization: Reducing database size while maintaining taxonomic coverage

Limitations and Considerations

Format Dependency: Requires semicolon-delimited taxonomic strings in standard Silva format
Order Sensitivity: The "first occurrence" rule means results depend on input sequence order
Quality Bias: No quality assessment - the first sequence may not be the best representative
Incomplete Hierarchies: Sequences with fewer taxonomic levels than specified by 'column' are kept unchanged

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org