FilterSilva
Removes unwanted sequences from Silva database, particularly bacteria flagged as eukaryotes due to name misidentification.
Basic Usage
filtersilva.sh in=x.fa out=y.fa
FilterSilva is designed to clean Silva databases by removing sequences that are incorrectly classified due to taxonomic naming issues. The tool specifically targets bacterial sequences that have been misidentified as eukaryotes.
Parameters
FilterSilva has minimal parameters focused on input/output and taxonomic tree specification.
Standard parameters
- in=<file>
- Input fasta file containing Silva sequences to be filtered. The sequences must have Silva-formatted headers containing taxonomic information.
- out=<file>
- Output file for filtered sequences. Only sequences that pass the filtering criteria will be written to this file.
Additional file parameters
- tree=auto
- Path to TaxTree file used for taxonomic parsing and classification. When set to "auto", the tool will attempt to locate the appropriate taxonomic tree automatically. The tree is essential for determining whether sequences are correctly classified as eukaryotes, bacteria, or archaea.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 4g
- -da
- Disable assertions.
Examples
Basic Silva Filtering
filtersilva.sh in=silva_database.fa out=silva_filtered.fa
Filter a Silva database file to remove problematic sequences, using automatic taxonomic tree detection.
Custom Taxonomic Tree
filtersilva.sh in=silva_raw.fa out=silva_clean.fa tree=/path/to/custom/taxtree.txt
Use a specific taxonomic tree file for classification instead of the automatic detection.
Memory-Intensive Processing
filtersilva.sh -Xmx16g in=large_silva.fa out=filtered_silva.fa
Process a large Silva database with increased memory allocation for better performance.
Algorithm Details
FilterSilva implements taxonomic sequence filtering using specialized header parsing and taxonomic tree validation:
Taxonomic Classification Engine
The filtering algorithm uses a two-stage taxonomic validation system:
- Silva Mode Activation: Sets TaxTree.SILVA_MODE=true to enable Silva-specific header parsing conventions and taxonomic hierarchy interpretation.
- Header Parsing via TaxTree.parseNodeFromHeader(): Extracts taxonomic information from Silva-formatted sequence headers, returning null for unparseable headers (which are immediately excluded).
- Eukaryote Classification via tree.isEukaryote(): Uses TaxNode ID-based lookup to determine if a sequence belongs to eukaryotic lineages.
Filtering Logic Implementation
The process() method applies sequential string-based filtering criteria to eukaryotic sequences:
- Organellar Sequence Detection: Excludes sequences containing ";Chloroplast;" or "Mitochondria" strings in headers, removing organellar sequences that represent endosymbiotic bacterial/archaeal origins.
- Prokaryotic Contamination Removal: Filters out sequences containing "Bacteria;" or "Archaea;" strings from eukaryotic classifications, addressing systematic misclassification issues.
- Unparseable Header Exclusion: Automatically rejects sequences where TaxTree.parseNodeFromHeader() returns null, ensuring all retained sequences have valid taxonomic assignments.
Streaming I/O Architecture
- Concurrent Stream Processing: Uses ConcurrentReadInputStream and ConcurrentReadOutputStream for parallel read/write operations with 4-element buffering via Shared.capBuffers(4).
- Single-Pass Architecture: Implements immediate write-through processing without intermediate storage, enabling processing of datasets larger than available memory.
- Buffer Management: Uses 4-element buffer size for concurrent streams (buff=4) optimized for single-threaded taxonomic processing workflows.
Statistical Tracking
The implementation maintains comprehensive processing metrics through dedicated counters:
- Read-Level Statistics: Tracks readsProcessed, readsOut counters for retention rate analysis.
- Base-Level Statistics: Maintains basesProcessed, basesOut counters for sequence length impact assessment.
- Performance Reporting: Uses Tools.timeReadsBasesProcessed() and Tools.readsBasesOut() for standardized processing reports.
Implementation Note: The code includes a comment acknowledging that distinguishing 16S from 18S sequences in eukaryotes requires additional downstream processing beyond header-based filtering.
Related Tools
For more comprehensive information on Silva database processing and related tools, see:
- BBSketch Guide: /bbtools/docs/guides/BBSketchGuide.txt contains detailed information about Silva database processing workflows.
- Taxonomy Tools: Other BBTools taxonomy-related utilities for database processing and classification.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org