BloomFilterParser

Script: bloomfilterparser.sh Package: bloom Class: ParseBloomFilter.java

Parses verbose output from bloomfilter.sh for a specific paper. Irrelevant for most people, but useful for reproducing published results. You use it to parse output from bloomfilter.sh and tabulate it.

Basic Usage

bloomfilterparser.sh in=<input file> out=<output file>

The input file should be whatever bloomfilter.sh prints to the screen (e.g., in=slurm-3249652.out out=summary.txt). You get details of calls to increment() if you add the verbose flag to bloomfilter.sh.

Parameters

This tool has a limited set of parameters focused on parsing bloomfilter output.

Input/Output Parameters

in=file: Input file containing verbose output from bloomfilter.sh. This should be a text file with the screen output captured from bloomfilter.sh execution.
out=file: Output file for the parsed and tabulated results. Default is "stdout.txt" if not specified.
invalid=file: Optional output file for invalid lines that couldn't be parsed. Lines that don't match expected patterns are written here.

Processing Parameters

lines=<integer>: Maximum number of lines to process. If negative or not specified, processes all lines in the input file.
verbose=<boolean>: Enable verbose output during parsing. Shows detailed processing information. Default: false

Standard Parameters

overwrite=true: Allow overwriting of existing output files.
append=false: Append to existing output files instead of overwriting.

Examples

Basic Parsing

bloomfilterparser.sh in=slurm-3249652.out out=summary.txt

Parses the output from a SLURM job that ran bloomfilter.sh and creates a summary table of the results.

With Invalid Line Capture

bloomfilterparser.sh in=bloomfilter_output.txt out=parsed_results.txt invalid=unparsed_lines.txt

Parses bloomfilter output while capturing any lines that couldn't be parsed into a separate file for review.

Limited Line Processing

bloomfilterparser.sh in=large_output.txt out=results.txt lines=1000

Only processes the first 1000 lines of the input file, useful for testing or when dealing with very large output files.

Algorithm Details

BloomFilterParser is a specialized text processing tool designed to extract structured data from bloomfilter.sh verbose output for research reproducibility.

Parsing Strategy

The parser uses pattern matching to identify and extract specific types of information from bloomfilter.sh output:

Header Detection: Lines beginning with '#' are treated as headers and preserved in the output
Command Parsing: Lines starting with "Executing bloom.BloomFilterWrapper" are parsed to extract thread count parameters (t=value)
Statistics Extraction: Lines containing "Keys Counted:" or "Increments:" have their numerical values extracted
Timing Information: Lines with "Filter creation:" have timing data extracted (second-to-last field)
Invalid Line Handling: Any lines that don't match these patterns are optionally written to a separate invalid file

Output Format

The tool converts verbose bloomfilter output into a tabular format suitable for analysis:

Thread counts are extracted and tabulated
Key counts and increment statistics are organized into columns
Filter creation times are captured for performance analysis
Header information is preserved to maintain context

Memory Usage

The parser uses minimal memory (default 300MB) and processes files line by line, making it suitable for large bloomfilter output files. Memory usage is configured via standard Java heap parameters.

Use Case

This tool is specifically designed for researchers who need to reproduce published results involving bloom filters. It was created for a specific paper and extracts exactly the metrics needed for that research. While not generally useful, it demonstrates how to systematically parse complex bioinformatics tool output for downstream analysis.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org