mergeOTUs

Script: mergeOTUs.sh Package: driver Class: MergeCoverageOTU.java

Merges coverage stats lines (from pileup) for the same OTU, according to some custom naming scheme.

Basic Usage

mergeOTUs.sh in=<file> out=<file>

This tool processes coverage statistics files generated by pileup tools and merges lines that correspond to the same Operational Taxonomic Unit (OTU). The OTU identifier is extracted from the second column of each line (after the first space and before the first tab character).

Parameters

mergeOTUs has a simple parameter set focused on input/output file specification.

Input/Output Parameters

in=<file>
Input file containing coverage statistics lines. This should be a file generated by pileup or similar coverage analysis tools, with OTU identifiers in the second column (after the first space, before the first tab). The file must start with a header line beginning with '#'.
out=<file>
Output file where merged coverage statistics will be written. The output maintains the same format as the input but with coverage statistics summed for each unique OTU identifier.

Examples

Basic OTU Merging

mergeOTUs.sh in=coverage_stats.txt out=merged_otus.txt

Merges coverage statistics from coverage_stats.txt, combining all lines that share the same OTU identifier into single merged entries in merged_otus.txt.

Processing Pileup Output

# First generate coverage statistics with pileup
pileup.sh in=mapped_reads.sam ref=reference.fa out=coverage.txt

# Then merge OTUs with identical identifiers
mergeOTUs.sh in=coverage.txt out=merged_coverage.txt

This workflow first generates detailed coverage statistics with pileup, then merges entries for the same OTU to get consolidated coverage data per taxonomic unit.

Alternative Parameter Order

mergeOTUs.sh out=merged_data.txt in=input_stats.txt

Parameters can be specified in any order. The tool automatically detects which parameter specifies input vs output based on the parameter names.

Algorithm Details

mergeOTUs implements a hash-based OTU consolidation algorithm using LinkedHashMap<String, CovStatsLine> storage and weighted averaging for coverage statistics merging:

Input Processing Implementation

Hash-Based Merging Implementation

Output Generation Implementation

Performance Characteristics

Input Format Requirements

The input file must follow this structure based on CovStatsLine parsing:

Technical Notes

OTU Identifier Extraction Implementation

The tool extracts OTU identifiers using specific string parsing from MergeCoverageOTU.java lines 41-42:

int space = s.indexOf(' ');
String otu = s.substring(space+1, s.indexOf('\t'));

This assumes the input format has the sequence identifier, followed by a space, then the OTU identifier, then a tab, then the coverage statistics. This format is compatible with standard pileup output when sequences are named with OTU information.

Coverage Statistics Accumulation Algorithm

The CovStatsLine.add() method from lines 77-89 handles mathematical combination of coverage statistics:

Memory Management

With the default 1GB memory allocation from mergeOTUs.sh:

Use Cases

Compatibility

mergeOTUs is designed to work with coverage statistics files generated by:

Upstream Tools

Typically used after mapping and coverage analysis:

# Example workflow
bbmap.sh in=reads.fq ref=reference.fa out=mapped.sam
pileup.sh in=mapped.sam ref=reference.fa out=coverage.txt
mergeOTUs.sh in=coverage.txt out=merged.txt

Downstream Analysis

Output can be used with:

Support

For questions and support:

For issues specific to OTU analysis workflows or coverage statistics interpretation, please include sample input data and the complete command used.