BloomFilterParser
Parses verbose output from bloomfilter.sh for a specific paper. Irrelevant for most people, but useful for reproducing published results. You use it to parse output from bloomfilter.sh and tabulate it.
Basic Usage
bloomfilterparser.sh in=<input file> out=<output file>
The input file should be whatever bloomfilter.sh prints to the screen (e.g., in=slurm-3249652.out out=summary.txt). You get details of calls to increment() if you add the verbose flag to bloomfilter.sh.
Parameters
This tool has a limited set of parameters focused on parsing bloomfilter output.
Input/Output Parameters
- in=file
- Input file containing verbose output from bloomfilter.sh. This should be a text file with the screen output captured from bloomfilter.sh execution.
- out=file
- Output file for the parsed and tabulated results. Default is "stdout.txt" if not specified.
- invalid=file
- Optional output file for invalid lines that couldn't be parsed. Lines that don't match expected patterns are written here.
Processing Parameters
- lines=<integer>
- Maximum number of lines to process. If negative or not specified, processes all lines in the input file.
- verbose=<boolean>
- Enable verbose output during parsing. Shows detailed processing information. Default: false
Standard Parameters
- overwrite=true
- Allow overwriting of existing output files.
- append=false
- Append to existing output files instead of overwriting.
Examples
Basic Parsing
bloomfilterparser.sh in=slurm-3249652.out out=summary.txt
Parses the output from a SLURM job that ran bloomfilter.sh and creates a summary table of the results.
With Invalid Line Capture
bloomfilterparser.sh in=bloomfilter_output.txt out=parsed_results.txt invalid=unparsed_lines.txt
Parses bloomfilter output while capturing any lines that couldn't be parsed into a separate file for review.
Limited Line Processing
bloomfilterparser.sh in=large_output.txt out=results.txt lines=1000
Only processes the first 1000 lines of the input file, useful for testing or when dealing with very large output files.
Algorithm Details
BloomFilterParser is a specialized text processing tool designed to extract structured data from bloomfilter.sh verbose output for research reproducibility.
Parsing Strategy
The parser uses pattern matching to identify and extract specific types of information from bloomfilter.sh output:
- Header Detection: Lines beginning with '#' are treated as headers and preserved in the output
- Command Parsing: Lines starting with "Executing bloom.BloomFilterWrapper" are parsed to extract thread count parameters (t=value)
- Statistics Extraction: Lines containing "Keys Counted:" or "Increments:" have their numerical values extracted
- Timing Information: Lines with "Filter creation:" have timing data extracted (second-to-last field)
- Invalid Line Handling: Any lines that don't match these patterns are optionally written to a separate invalid file
Output Format
The tool converts verbose bloomfilter output into a tabular format suitable for analysis:
- Thread counts are extracted and tabulated
- Key counts and increment statistics are organized into columns
- Filter creation times are captured for performance analysis
- Header information is preserved to maintain context
Memory Usage
The parser uses minimal memory (default 300MB) and processes files line by line, making it suitable for large bloomfilter output files. Memory usage is configured via standard Java heap parameters.
Use Case
This tool is specifically designed for researchers who need to reproduce published results involving bloom filters. It was created for a specific paper and extracts exactly the metrics needed for that research. While not generally useful, it demonstrates how to systematically parse complex bioinformatics tool output for downstream analysis.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org