mergeOTUs

Basic Usage

mergeOTUs.sh in=<file> out=<file>

This tool processes coverage statistics files generated by pileup tools and merges lines that correspond to the same Operational Taxonomic Unit (OTU). The OTU identifier is extracted from the second column of each line (after the first space and before the first tab character).

Parameters

mergeOTUs has a simple parameter set focused on input/output file specification.

Input/Output Parameters

in=<file>: Input file containing coverage statistics lines. This should be a file generated by pileup or similar coverage analysis tools, with OTU identifiers in the second column (after the first space, before the first tab). The file must start with a header line beginning with '#'.
out=<file>: Output file where merged coverage statistics will be written. The output maintains the same format as the input but with coverage statistics summed for each unique OTU identifier.

Examples

Basic OTU Merging

mergeOTUs.sh in=coverage_stats.txt out=merged_otus.txt

Merges coverage statistics from coverage_stats.txt, combining all lines that share the same OTU identifier into single merged entries in merged_otus.txt.

Processing Pileup Output

# First generate coverage statistics with pileup
pileup.sh in=mapped_reads.sam ref=reference.fa out=coverage.txt

# Then merge OTUs with identical identifiers
mergeOTUs.sh in=coverage.txt out=merged_coverage.txt

This workflow first generates detailed coverage statistics with pileup, then merges entries for the same OTU to get consolidated coverage data per taxonomic unit.

Alternative Parameter Order

mergeOTUs.sh out=merged_data.txt in=input_stats.txt

Parameters can be specified in any order. The tool automatically detects which parameter specifies input vs output based on the parameter names.

Algorithm Details

mergeOTUs implements a hash-based OTU consolidation algorithm using LinkedHashMap<String, CovStatsLine> storage and weighted averaging for coverage statistics merging:

Input Processing Implementation

Header Initialization: Line 0 must start with '#' - passed to CovStatsLine.initializeHeader() which parses column names into field number mappings (id_FNUM, length_FNUM, etc.)
OTU Extraction Algorithm: Uses indexOf(' ') and indexOf('\t') to locate OTU identifier between first space and first tab: String otu = s.substring(space+1, s.indexOf('\t'))
CovStatsLine Parsing: Each data line splits on tabs, parses fields by position using field number mappings from header initialization

Hash-Based Merging Implementation

LinkedHashMap<String, CovStatsLine>: Maintains insertion order while providing O(1) average lookup time for OTU identifiers
Coverage Accumulation via CovStatsLine.add(): Weighted averaging for avgFold, refGC using length weighting: avgFold=((avgFold*length)+(csl.avgFold*csl.length))*invlen2
Count Summation: Direct addition for length, coveredBases, plusReads, minusReads, median, underMin fields
Single-Pass Processing: File processed line-by-line via TextFile.nextLine() with no buffering of complete dataset

Output Generation Implementation

Insertion Order Preservation: LinkedHashMap maintains the order in which OTU identifiers were first encountered in the input
ID Assignment via csl.id=s: Sets the final OTU identifier in each merged CovStatsLine before output
TextStreamWriter Output: Uses TextStreamWriter with threaded writing via tsw.start() and tsw.poisonAndWait()
Format Consistency: Output uses CovStatsLine.toString() with tab-delimited format matching input structure

Performance Characteristics

Time Complexity: O(n) where n is the number of input lines, with O(1) average lookup time for OTU identifiers in LinkedHashMap
Space Complexity: O(k) where k is the number of unique OTU identifiers
Memory Usage: Default 1GB heap allocation (-Xmx1g) as specified in mergeOTUs.sh line 32
Coverage Statistics Merging: Length-weighted averaging for continuous values (avgFold, refGC), read-count-weighted for readGC

Input Format Requirements

The input file must follow this structure based on CovStatsLine parsing:

First line: Header starting with '#' (e.g., "#ID length bases coverage reads RPM FPKM")
Subsequent lines: Coverage data with OTU identifier in second column
Column separation: Tab-delimited format expected by CovStatsLine.split("\t")
OTU position: Located after first space and before first tab in each line

Technical Notes

OTU Identifier Extraction Implementation

The tool extracts OTU identifiers using specific string parsing from MergeCoverageOTU.java lines 41-42:

int space = s.indexOf(' ');
String otu = s.substring(space+1, s.indexOf('\t'));

This assumes the input format has the sequence identifier, followed by a space, then the OTU identifier, then a tab, then the coverage statistics. This format is compatible with standard pileup output when sequences are named with OTU information.

Coverage Statistics Accumulation Algorithm

The CovStatsLine.add() method from lines 77-89 handles mathematical combination of coverage statistics:

Length-weighted averaging: avgFold and refGC use ((field1*length1)+(field2*length2))*invlen2
Read-count-weighted averaging: readGC uses total read counts as denominator
Direct summation: length, coveredBases, plusReads, minusReads, median, underMin are added directly
Inverse length calculation: Uses invlen2=1d/(length+csl.length) to avoid division operations

Memory Management

With the default 1GB memory allocation from mergeOTUs.sh:

LinkedHashMap overhead: ~32 bytes per entry plus key/value objects
CovStatsLine objects: ~200 bytes per unique OTU (approximate, based on field count)
Estimated capacity: 2-5 million unique OTUs depending on OTU identifier lengths
TextFile and TextStreamWriter use minimal buffering for streaming I/O

Use Cases

Metagenomic Analysis: Consolidating coverage data for sequences assigned to the same taxonomic unit
Amplicon Studies: Merging statistics for sequences clustered into the same OTU
Comparative Genomics: Combining coverage data for sequences from the same species or strain
Quality Control: Generating per-OTU summary statistics for large sequencing experiments

Compatibility

mergeOTUs is designed to work with coverage statistics files generated by:

pileup.sh: Direct compatibility with pileup output format
CovStats utilities: Any tool that generates CovStatsLine-compatible format
Custom pipelines: Tab-delimited files following the expected column structure

Upstream Tools

Typically used after mapping and coverage analysis:

# Example workflow
bbmap.sh in=reads.fq ref=reference.fa out=mapped.sam
pileup.sh in=mapped.sam ref=reference.fa out=coverage.txt
mergeOTUs.sh in=coverage.txt out=merged.txt

Downstream Analysis

Output can be used with:

Statistical analysis tools that work with coverage data
Visualization tools for OTU abundance analysis
Comparative analysis pipelines for metagenomic studies
Quality assessment tools for sequencing experiments

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org

For issues specific to OTU analysis workflows or coverage statistics interpretation, please include sample input data and the complete command used.