ReduceSilva

Script: reducesilva.sh Package: driver Class: ReduceSilva.java

Reduces Silva entries down to one entry per taxa by keeping only the first occurrence of each taxonomic group. The tool splits semicolon-delimited taxonomic names and filters sequences based on the specified taxonomic level, assuming the standard format: kingdom;phylum;class;order;family;genus;species.

Basic Usage

reducesilva.sh in=<file> out=<file> column=<1>

The tool reads FASTA sequences with semicolon-delimited taxonomic information in the headers and outputs a reduced set containing only the first representative of each taxonomic group at the specified level.

Parameters

Parameters are organized by function in the reduction process.

Core Parameters

column
The taxonomic level to filter by. 0=species, 1=genus, 2=family, 3=order, 4=class, 5=phylum, 6=kingdom. The tool counts from the right end of the semicolon-delimited string. Default: 1 (genus level).
ow=f
(overwrite) Overwrites output files that already exist. Default: false.
zl=4
(ziplevel) Set compression level for output files, 1 (low compression, fast) to 9 (maximum compression, slow). Default: 4.
fastawrap=70
Length of lines in FASTA output. Sequences longer than this value will be wrapped to multiple lines. Default: 70 characters.

Sampling Parameters

reads=-1
Set to a positive number to only process this many INPUT sequences, then quit. Useful for testing or processing large datasets in chunks. Default: -1 (process all sequences).

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 1g for this tool.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions. May provide a small performance boost in production environments.

Examples

Basic Genus-Level Reduction

reducesilva.sh in=silva_full.fasta out=silva_genus.fasta column=1

Reduces Silva database to one representative per genus. For sequences with headers like "Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;E.coli", only the first E.coli sequence encountered will be kept.

Species-Level Reduction

reducesilva.sh in=silva_full.fasta out=silva_species.fasta column=0

Keeps only one representative per species (finest taxonomic resolution).

Family-Level Reduction with Compression

reducesilva.sh in=silva_full.fasta out=silva_family.fasta.gz column=2 zl=9

Reduces to family level with maximum compression. Useful for creating compact reference databases.

Testing with Limited Input

reducesilva.sh in=silva_full.fasta out=test_output.fasta column=1 reads=1000

Process only the first 1000 sequences for testing parameters or workflow validation.

Algorithm Details

Taxonomic Processing Strategy

ReduceSilva uses a hash-based deduplication approach to efficiently filter taxonomic sequences:

Parsing Method

Deduplication Algorithm

Performance Characteristics

Use Cases

Limitations and Considerations

Support

For questions and support: