ProcessFrag
Reformats output from a script. Made for generating the BBMerge paper data.
Basic Usage
processfrags.sh <file>
Takes a single input file containing script output to be reformatted. This tool was specifically designed for processing and collating data used in the BBMerge research paper.
Parameters
ProcessFrag is a specialized utility with minimal parameters. It primarily uses standard Java runtime parameters for memory management.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 100m
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Usage
# Process BBMerge comparison output
processfrags.sh comparison_results.txt
# Process with custom memory allocation
processfrags.sh -Xmx2g comparison_results.txt
The first command processes a file containing BBMerge comparison results. The second example allocates 2GB of memory for processing larger datasets.
Algorithm Details
Data Processing Strategy
ProcessFrag implements a line-by-line text processing algorithm specifically designed for reformatting BBMerge comparison output into a structured tabular format. The tool uses pattern matching to extract key metrics from various types of output lines:
Extracted Metrics
- Timing Information: Parses "real" time entries and converts them to seconds using a dedicated time conversion algorithm that handles minutes and seconds format (e.g., "2m30.5s" → 150.5 seconds)
- Read Statistics: Extracts read counts and percentages from "Reads Used:" lines
- Mapping Statistics: Processes "mapped:" entries to extract mapping counts and percentages
- Error Rate Analysis: Parses multiple error rate categories including overall error rate, substitution rate, deletion rate, and insertion rate
Output Format
The algorithm produces tab-delimited output with the following structure:
- Dataset identifier (extracted from lines starting with "***")
- Processing time in seconds (converted from mm:ss.ss format)
- Read usage statistics (counts and percentages)
- Mapping statistics (counts and percentages)
- Comprehensive error rate metrics (overall, substitution, deletion, insertion rates with both percentages and absolute counts)
Processing Characteristics
- Memory Efficient: Processes files line-by-line with minimal memory footprint (default 100MB allocation)
- Pattern-Based Parsing: Uses regex pattern matching to identify and extract specific data types from mixed-format input
- Whitespace Handling: Employs Java's whitespace regex pattern (\\p{javaWhitespace}+) for robust field separation
- Real-Time Output: Produces formatted output incrementally as input is processed
Research Application
This tool was specifically developed for the BBMerge research paper to standardize the format of comparison data across multiple alignment tools and parameter sets. The consistent tabular output facilitates statistical analysis and visualization of tool performance metrics.
Input Format
ProcessFrag expects input files containing specific line patterns from BBMerge comparison scripts:
- Dataset markers: Lines starting with "***" followed by dataset name
- Timing data: Lines starting with "real" followed by tab-separated time in format "XmY.ZZs"
- Read statistics: Lines starting with "Reads Used:" containing read counts and percentages
- Mapping data: Lines starting with "mapped:" containing mapping statistics
- Error rates: Lines starting with "Error Rate:", "Sub Rate:", "Del Rate:", or "Ins Rate:" containing rate percentages and counts
Output Format
The tool generates tab-delimited output suitable for spreadsheet import or further statistical analysis. Each dataset produces one row with the following columns:
- Dataset name
- Processing time (seconds)
- Reads used count
- Reads used percentage
- Mapped count
- Mapped percentage
- Overall error rate percentage
- Overall error count
- Substitution rate percentage
- Substitution count
- Deletion rate percentage
- Deletion count
- Insertion rate percentage
- Insertion count
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org