ReduceSilva
Reduces Silva entries down to one entry per taxa by keeping only the first occurrence of each taxonomic group. The tool splits semicolon-delimited taxonomic names and filters sequences based on the specified taxonomic level, assuming the standard format: kingdom;phylum;class;order;family;genus;species.
Basic Usage
reducesilva.sh in=<file> out=<file> column=<1>
The tool reads FASTA sequences with semicolon-delimited taxonomic information in the headers and outputs a reduced set containing only the first representative of each taxonomic group at the specified level.
Parameters
Parameters are organized by function in the reduction process.
Core Parameters
- column
- The taxonomic level to filter by. 0=species, 1=genus, 2=family, 3=order, 4=class, 5=phylum, 6=kingdom. The tool counts from the right end of the semicolon-delimited string. Default: 1 (genus level).
- ow=f
- (overwrite) Overwrites output files that already exist. Default: false.
- zl=4
- (ziplevel) Set compression level for output files, 1 (low compression, fast) to 9 (maximum compression, slow). Default: 4.
- fastawrap=70
- Length of lines in FASTA output. Sequences longer than this value will be wrapped to multiple lines. Default: 70 characters.
Sampling Parameters
- reads=-1
- Set to a positive number to only process this many INPUT sequences, then quit. Useful for testing or processing large datasets in chunks. Default: -1 (process all sequences).
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 1g for this tool.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions. May provide a small performance boost in production environments.
Examples
Basic Genus-Level Reduction
reducesilva.sh in=silva_full.fasta out=silva_genus.fasta column=1
Reduces Silva database to one representative per genus. For sequences with headers like "Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;E.coli", only the first E.coli sequence encountered will be kept.
Species-Level Reduction
reducesilva.sh in=silva_full.fasta out=silva_species.fasta column=0
Keeps only one representative per species (finest taxonomic resolution).
Family-Level Reduction with Compression
reducesilva.sh in=silva_full.fasta out=silva_family.fasta.gz column=2 zl=9
Reduces to family level with maximum compression. Useful for creating compact reference databases.
Testing with Limited Input
reducesilva.sh in=silva_full.fasta out=test_output.fasta column=1 reads=1000
Process only the first 1000 sequences for testing parameters or workflow validation.
Algorithm Details
Taxonomic Processing Strategy
ReduceSilva uses a hash-based deduplication approach to efficiently filter taxonomic sequences:
Parsing Method
- Header Splitting: Sequence headers are split on semicolons to extract taxonomic hierarchy
- Level Selection: The taxonomic level is selected by counting from the right end (species=0, genus=1, family=2, etc.)
- Reliability Limitation: The tool assumes standard Silva format (kingdom;phylum;class;order;family;genus;species) and may not work reliably with non-standard taxonomic strings
Deduplication Algorithm
- Hash Table Tracking: Uses a HashSet<String> to track previously seen taxonomic names at the specified level
- First-Occurrence Rule: Only the first sequence encountered for each taxonomic group is retained
- Memory Efficiency: Stores only the taxonomic names (not full sequences) in memory for tracking
- Streaming Processing: Processes sequences one at a time, allowing handling of very large databases
Performance Characteristics
- Memory Usage: Scales with the number of unique taxa at the specified level, not total sequence count
- Processing Speed: Linear time complexity O(n) where n is the number of input sequences
- I/O Efficiency: Supports compressed input/output and streaming processing for large files
- Default Memory: Allocates 1GB RAM by default, suitable for most Silva database sizes
Use Cases
- Database Curation: Creating non-redundant reference databases from Silva
- Phylogenetic Analysis: Reducing computational complexity by limiting to one representative per taxonomic group
- Pipeline Preprocessing: Preparing simplified reference sets for downstream analysis
- Storage Optimization: Reducing database size while maintaining taxonomic coverage
Limitations and Considerations
- Format Dependency: Requires semicolon-delimited taxonomic strings in standard Silva format
- Order Sensitivity: The "first occurrence" rule means results depend on input sequence order
- Quality Bias: No quality assessment - the first sequence may not be the best representative
- Incomplete Hierarchies: Sequences with fewer taxonomic levels than specified by 'column' are kept unchanged
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org