ProcessFrag

Basic Usage

processfrags.sh <file>

Takes a single input file containing script output to be reformatted. This tool was specifically designed for processing and collating data used in the BBMerge research paper.

Parameters

ProcessFrag is a specialized utility with minimal parameters. It primarily uses standard Java runtime parameters for memory management.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 100m
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Usage

# Process BBMerge comparison output
processfrags.sh comparison_results.txt

# Process with custom memory allocation
processfrags.sh -Xmx2g comparison_results.txt

The first command processes a file containing BBMerge comparison results. The second example allocates 2GB of memory for processing larger datasets.

Algorithm Details

Data Processing Strategy

ProcessFrag implements a line-by-line text processing algorithm specifically designed for reformatting BBMerge comparison output into a structured tabular format. The tool uses pattern matching to extract key metrics from various types of output lines:

Extracted Metrics

Timing Information: Parses "real" time entries and converts them to seconds using a dedicated time conversion algorithm that handles minutes and seconds format (e.g., "2m30.5s" → 150.5 seconds)
Read Statistics: Extracts read counts and percentages from "Reads Used:" lines
Mapping Statistics: Processes "mapped:" entries to extract mapping counts and percentages
Error Rate Analysis: Parses multiple error rate categories including overall error rate, substitution rate, deletion rate, and insertion rate

Output Format

The algorithm produces tab-delimited output with the following structure:

Dataset identifier (extracted from lines starting with "***")
Processing time in seconds (converted from mm:ss.ss format)
Read usage statistics (counts and percentages)
Mapping statistics (counts and percentages)
Comprehensive error rate metrics (overall, substitution, deletion, insertion rates with both percentages and absolute counts)

Processing Characteristics

Memory Efficient: Processes files line-by-line with minimal memory footprint (default 100MB allocation)
Pattern-Based Parsing: Uses regex pattern matching to identify and extract specific data types from mixed-format input
Whitespace Handling: Employs Java's whitespace regex pattern (\\p{javaWhitespace}+) for robust field separation
Real-Time Output: Produces formatted output incrementally as input is processed

Research Application

This tool was specifically developed for the BBMerge research paper to standardize the format of comparison data across multiple alignment tools and parameter sets. The consistent tabular output facilitates statistical analysis and visualization of tool performance metrics.

Input Format

ProcessFrag expects input files containing specific line patterns from BBMerge comparison scripts:

Dataset markers: Lines starting with "***" followed by dataset name
Timing data: Lines starting with "real" followed by tab-separated time in format "XmY.ZZs"
Read statistics: Lines starting with "Reads Used:" containing read counts and percentages
Mapping data: Lines starting with "mapped:" containing mapping statistics
Error rates: Lines starting with "Error Rate:", "Sub Rate:", "Del Rate:", or "Ins Rate:" containing rate percentages and counts

Output Format

The tool generates tab-delimited output suitable for spreadsheet import or further statistical analysis. Each dataset produces one row with the following columns:

Dataset name
Processing time (seconds)
Reads used count
Reads used percentage
Mapped count
Mapped percentage
Overall error rate percentage
Overall error count
Substitution rate percentage
Substitution count
Deletion rate percentage
Deletion count
Insertion rate percentage
Insertion count

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org