SummarizeMerge
Summarizes the output of GradeMerge for comparing read-merging performance.
Basic Usage
summarizemerge.sh in=<file>
This tool processes GradeMerge output files and extracts key performance metrics into a tab-delimited summary format for easy comparison of different read-merging tools and parameters.
Parameters
SummarizeMerge has a simple parameter set focused on processing GradeMerge output files.
Parameters
- in=<file>
- A file containing GradeMerge output. This should be the output from running GradeMerge to evaluate the accuracy of read merging tools. The file must contain timing information (real, user, sys), accuracy metrics (correct, incorrect reads), and signal-to-noise ratio data.
Examples
Basic Summary Generation
summarizemerge.sh in=grademerge_results.txt
Processes a GradeMerge output file and generates a tab-delimited summary with timing and accuracy metrics.
Processing Multiple Test Results
# Generate summaries for different parameter sets
summarizemerge.sh in=bbmerge_test1.out > bbmerge_summary1.txt
summarizemerge.sh in=bbmerge_test2.out > bbmerge_summary2.txt
summarizemerge.sh in=flash_test.out > flash_summary.txt
Create individual summary files for different merging tool evaluations to enable side-by-side comparison.
Combined Analysis Pipeline
# Run GradeMerge evaluation and immediately summarize
grademerge.sh ref=reference.fa reads1=r1.fq reads2=r2.fq merged=merged.fq > evaluation.out
summarizemerge.sh in=evaluation.out > summary.txt
Complete pipeline from evaluation to summary generation for read merging performance assessment.
Output Format
SummarizeMerge generates tab-delimited output with the following columns:
- real - Real time elapsed (seconds)
- user - User CPU time (seconds)
- sys - System CPU time (seconds)
- correct - Percentage of correctly merged reads
- incorrect - Percentage of incorrectly merged reads
- SNR - Signal-to-noise ratio
Example Output
#real user sys correct incorrect SNR
12.450 11.230 0.890 99.72 0.28 25.539
8.760 7.980 0.650 98.45 1.55 18.234
Header line followed by data rows showing performance metrics for each evaluated condition.
Algorithm Details
SummarizeMerge implements ProcessSpeed.main() method with TextFile-based line-by-line parsing for GradeMerge output format:
Input Processing Strategy
The implementation uses String.startsWith() method for prefix-based line classification:
- Timing Extraction - String.startsWith("real\t"), String.startsWith("user\t"), and String.startsWith("sys\t") identify timing lines, followed by String.split("\t")[1] to extract time values
- Accuracy Parsing - String.startsWith("Correct:") and String.startsWith("Incorrect:") locate accuracy lines, with String.split("\\p{javaWhitespace}+")[2] extracting percentage values
- Quality Metrics - String.startsWith("SNR:") identifies signal-to-noise lines, using String.split("\\p{javaWhitespace}+")[1] for value extraction
- Section Identification - String.startsWith("***") detects section markers for test condition separation
Time Conversion Algorithm
The toSeconds() method implements string parsing for shell timing format conversion:
- Input format: "Xm Ys" processed by String.replaceAll("s", "") and String.split("m")
- Conversion: (X * 60) + Y using Double.parseDouble() on split components
- Output formatting: Tools.format("%.3f\t", seconds) provides three decimal place precision
Data Flow Architecture
Processing follows TextFile.nextLine() streaming pattern in ProcessSpeed.main():
- Line-by-line processing - TextFile tf = new TextFile(fname) followed by for(String line=tf.nextLine(); line!=null; line=tf.nextLine()) loop
- Pattern matching - Direct String.startsWith() calls for O(1) prefix identification without regex overhead
- State management - Sequential processing through if-else chain maintains parsing context per line
- Immediate output - System.out.print() calls generate tab-delimited output as metrics are parsed
Memory Efficiency
The implementation uses minimal memory allocation strategies:
- Stream processing - TextFile class provides buffered reading without loading entire file into memory
- Small memory footprint - JVM allocation limited to 120MB via -Xmx120m parameter in shell script
- No data structures - Direct string processing without HashMap or ArrayList storage eliminates object allocation overhead
Parsing Implementation Details
- Whitespace parsing - String.split("\\p{javaWhitespace}+") handles variable spacing using Java regex whitespace character class
- Tab delimited parsing - String.split("\t") for precise tab-separated value extraction
- String replacement - line.replace("\\*\\*\\*", "").trim() for section marker cleanup
Use Cases
Read Merging Tool Comparison
Primary use case for comparing different read merging tools:
- BBMerge vs FLASH vs PEAR performance analysis
- Parameter optimization studies
- Accuracy vs speed trade-off evaluation
Pipeline Benchmarking
Integration into automated testing workflows:
- Continuous integration testing of merging algorithms
- Regression testing for BBTools releases
- Performance monitoring across different datasets
Research Applications
Supporting bioinformatics research activities:
- Method development and validation
- Publication-quality performance comparisons
- Dataset-specific optimization
Technical Notes
Input Requirements
- Input file must contain GradeMerge output format
- Required sections: timing data, accuracy metrics, SNR values
- Section markers ("***") are used to separate different test conditions
Performance Characteristics
- Memory usage - JVM heap limited to 120MB via -Xmx120m shell script parameter
- Processing speed - Single-pass TextFile.nextLine() iteration with O(1) String.startsWith() pattern matching per line
- Scalability - Linear O(n) complexity scaling with input file line count
Limitations
- Designed specifically for GradeMerge output format
- Cannot process other types of benchmarking data
- Requires specific line prefixes and patterns in input
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org