mergeOTUs
Merges coverage stats lines (from pileup) for the same OTU, according to some custom naming scheme.
Basic Usage
mergeOTUs.sh in=<file> out=<file>
This tool processes coverage statistics files generated by pileup tools and merges lines that correspond to the same Operational Taxonomic Unit (OTU). The OTU identifier is extracted from the second column of each line (after the first space and before the first tab character).
Parameters
mergeOTUs has a simple parameter set focused on input/output file specification.
Input/Output Parameters
- in=<file>
- Input file containing coverage statistics lines. This should be a file generated by pileup or similar coverage analysis tools, with OTU identifiers in the second column (after the first space, before the first tab). The file must start with a header line beginning with '#'.
- out=<file>
- Output file where merged coverage statistics will be written. The output maintains the same format as the input but with coverage statistics summed for each unique OTU identifier.
Examples
Basic OTU Merging
mergeOTUs.sh in=coverage_stats.txt out=merged_otus.txt
Merges coverage statistics from coverage_stats.txt, combining all lines that share the same OTU identifier into single merged entries in merged_otus.txt.
Processing Pileup Output
# First generate coverage statistics with pileup
pileup.sh in=mapped_reads.sam ref=reference.fa out=coverage.txt
# Then merge OTUs with identical identifiers
mergeOTUs.sh in=coverage.txt out=merged_coverage.txt
This workflow first generates detailed coverage statistics with pileup, then merges entries for the same OTU to get consolidated coverage data per taxonomic unit.
Alternative Parameter Order
mergeOTUs.sh out=merged_data.txt in=input_stats.txt
Parameters can be specified in any order. The tool automatically detects which parameter specifies input vs output based on the parameter names.
Algorithm Details
mergeOTUs implements a hash-based OTU consolidation algorithm using LinkedHashMap<String, CovStatsLine> storage and weighted averaging for coverage statistics merging:
Input Processing Implementation
- Header Initialization: Line 0 must start with '#' - passed to CovStatsLine.initializeHeader() which parses column names into field number mappings (id_FNUM, length_FNUM, etc.)
- OTU Extraction Algorithm: Uses indexOf(' ') and indexOf('\t') to locate OTU identifier between first space and first tab:
String otu = s.substring(space+1, s.indexOf('\t'))
- CovStatsLine Parsing: Each data line splits on tabs, parses fields by position using field number mappings from header initialization
Hash-Based Merging Implementation
- LinkedHashMap<String, CovStatsLine>: Maintains insertion order while providing O(1) average lookup time for OTU identifiers
- Coverage Accumulation via CovStatsLine.add(): Weighted averaging for avgFold, refGC using length weighting:
avgFold=((avgFold*length)+(csl.avgFold*csl.length))*invlen2
- Count Summation: Direct addition for length, coveredBases, plusReads, minusReads, median, underMin fields
- Single-Pass Processing: File processed line-by-line via TextFile.nextLine() with no buffering of complete dataset
Output Generation Implementation
- Insertion Order Preservation: LinkedHashMap maintains the order in which OTU identifiers were first encountered in the input
- ID Assignment via csl.id=s: Sets the final OTU identifier in each merged CovStatsLine before output
- TextStreamWriter Output: Uses TextStreamWriter with threaded writing via tsw.start() and tsw.poisonAndWait()
- Format Consistency: Output uses CovStatsLine.toString() with tab-delimited format matching input structure
Performance Characteristics
- Time Complexity: O(n) where n is the number of input lines, with O(1) average lookup time for OTU identifiers in LinkedHashMap
- Space Complexity: O(k) where k is the number of unique OTU identifiers
- Memory Usage: Default 1GB heap allocation (-Xmx1g) as specified in mergeOTUs.sh line 32
- Coverage Statistics Merging: Length-weighted averaging for continuous values (avgFold, refGC), read-count-weighted for readGC
Input Format Requirements
The input file must follow this structure based on CovStatsLine parsing:
- First line: Header starting with '#' (e.g., "#ID length bases coverage reads RPM FPKM")
- Subsequent lines: Coverage data with OTU identifier in second column
- Column separation: Tab-delimited format expected by CovStatsLine.split("\t")
- OTU position: Located after first space and before first tab in each line
Technical Notes
OTU Identifier Extraction Implementation
The tool extracts OTU identifiers using specific string parsing from MergeCoverageOTU.java lines 41-42:
int space = s.indexOf(' ');
String otu = s.substring(space+1, s.indexOf('\t'));
This assumes the input format has the sequence identifier, followed by a space, then the OTU identifier, then a tab, then the coverage statistics. This format is compatible with standard pileup output when sequences are named with OTU information.
Coverage Statistics Accumulation Algorithm
The CovStatsLine.add() method from lines 77-89 handles mathematical combination of coverage statistics:
- Length-weighted averaging: avgFold and refGC use
((field1*length1)+(field2*length2))*invlen2
- Read-count-weighted averaging: readGC uses total read counts as denominator
- Direct summation: length, coveredBases, plusReads, minusReads, median, underMin are added directly
- Inverse length calculation: Uses
invlen2=1d/(length+csl.length)
to avoid division operations
Memory Management
With the default 1GB memory allocation from mergeOTUs.sh:
- LinkedHashMap overhead: ~32 bytes per entry plus key/value objects
- CovStatsLine objects: ~200 bytes per unique OTU (approximate, based on field count)
- Estimated capacity: 2-5 million unique OTUs depending on OTU identifier lengths
- TextFile and TextStreamWriter use minimal buffering for streaming I/O
Use Cases
- Metagenomic Analysis: Consolidating coverage data for sequences assigned to the same taxonomic unit
- Amplicon Studies: Merging statistics for sequences clustered into the same OTU
- Comparative Genomics: Combining coverage data for sequences from the same species or strain
- Quality Control: Generating per-OTU summary statistics for large sequencing experiments
Compatibility
mergeOTUs is designed to work with coverage statistics files generated by:
- pileup.sh: Direct compatibility with pileup output format
- CovStats utilities: Any tool that generates CovStatsLine-compatible format
- Custom pipelines: Tab-delimited files following the expected column structure
Upstream Tools
Typically used after mapping and coverage analysis:
# Example workflow
bbmap.sh in=reads.fq ref=reference.fa out=mapped.sam
pileup.sh in=mapped.sam ref=reference.fa out=coverage.txt
mergeOTUs.sh in=coverage.txt out=merged.txt
Downstream Analysis
Output can be used with:
- Statistical analysis tools that work with coverage data
- Visualization tools for OTU abundance analysis
- Comparative analysis pipelines for metagenomic studies
- Quality assessment tools for sequencing experiments
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
For issues specific to OTU analysis workflows or coverage statistics interpretation, please include sample input data and the complete command used.