FilterSilva

Script: filtersilva.sh Package: prok Class: FilterSilva.java

Removes unwanted sequences from Silva database, particularly bacteria flagged as eukaryotes due to name misidentification.

Basic Usage

filtersilva.sh in=x.fa out=y.fa

FilterSilva is designed to clean Silva databases by removing sequences that are incorrectly classified due to taxonomic naming issues. The tool specifically targets bacterial sequences that have been misidentified as eukaryotes.

Parameters

FilterSilva has minimal parameters focused on input/output and taxonomic tree specification.

Standard parameters

in=<file>
Input fasta file containing Silva sequences to be filtered. The sequences must have Silva-formatted headers containing taxonomic information.
out=<file>
Output file for filtered sequences. Only sequences that pass the filtering criteria will be written to this file.

Additional file parameters

tree=auto
Path to TaxTree file used for taxonomic parsing and classification. When set to "auto", the tool will attempt to locate the appropriate taxonomic tree automatically. The tree is essential for determining whether sequences are correctly classified as eukaryotes, bacteria, or archaea.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 4g
-da
Disable assertions.

Examples

Basic Silva Filtering

filtersilva.sh in=silva_database.fa out=silva_filtered.fa

Filter a Silva database file to remove problematic sequences, using automatic taxonomic tree detection.

Custom Taxonomic Tree

filtersilva.sh in=silva_raw.fa out=silva_clean.fa tree=/path/to/custom/taxtree.txt

Use a specific taxonomic tree file for classification instead of the automatic detection.

Memory-Intensive Processing

filtersilva.sh -Xmx16g in=large_silva.fa out=filtered_silva.fa

Process a large Silva database with increased memory allocation for better performance.

Algorithm Details

FilterSilva implements taxonomic sequence filtering using specialized header parsing and taxonomic tree validation:

Taxonomic Classification Engine

The filtering algorithm uses a two-stage taxonomic validation system:

Filtering Logic Implementation

The process() method applies sequential string-based filtering criteria to eukaryotic sequences:

Streaming I/O Architecture

Statistical Tracking

The implementation maintains comprehensive processing metrics through dedicated counters:

Implementation Note: The code includes a comment acknowledging that distinguishing 16S from 18S sequences in eukaryotes requires additional downstream processing beyond header-based filtering.

Related Tools

For more comprehensive information on Silva database processing and related tools, see:

Support

For questions and support: