PlotHist

Basic Usage

plothist.sh in=<input file> bins=<number>

PlotHist reads tab-delimited numeric data files with headers and generates histogram data for each column. The tool is specifically designed for processing tile dump data but can handle any 2D numeric matrix format. Each column in the input generates a separate .tsv output file containing binned frequency data.

Parameters

Parameters control input processing, histogram generation, and output behavior. The tool uses a two-pass algorithm to first determine data ranges then generate binned histograms.

Parameters

in=<file>: Input dump file. Must be tab-delimited with a header line starting with '#'. Each column should contain numeric data. The header defines the output filenames (column_name.tsv).
bins=1000: Bins per histogram. Determines the resolution of the output histograms. Each column's data range is divided into this many equal-width bins. Higher values provide more detailed distributions but larger output files.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file. When true, existing output files will be replaced without warning.
verbose=f: Print verbose messages during processing. Enables detailed logging of file operations and processing statistics. Useful for debugging or monitoring large dataset processing.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default is 300m for PlotHist.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to detect memory issues early.
-da: Disable assertions. Can provide minor performance improvements in production environments by skipping internal consistency checks.

Examples

Basic Histogram Generation

plothist.sh in=tile_data.txt bins=500

Creates histograms from tile_data.txt using 500 bins per column. Each column in the file will generate a separate .tsv file (e.g., quality.tsv, intensity.tsv).

High-Resolution Analysis

plothist.sh in=sequencing_metrics.txt bins=2000 overwrite=f

Generates high-resolution histograms with 2000 bins and prevents overwriting existing files. Useful for detailed distribution analysis.

Verbose Processing

plothist.sh in=large_dataset.txt bins=1000 verbose=t -Xmx8g

Processes a large dataset with verbose logging and increased memory allocation. The verbose output helps monitor progress on large files.

Input Format

PlotHist expects tab-delimited input files with a specific format:

Header line: Must start with '#' followed by tab-separated column names
Data lines: Tab-delimited numeric values, one per column
Comments: Lines starting with '#' (except the header) are ignored
Empty lines: Ignored during processing

Example Input Format

#quality	intensity	gc_content
28.5	1245.2	0.42
31.2	1389.7	0.38
# This is a comment
29.8	1156.3	0.45

This would generate three output files: quality.tsv, intensity.tsv, and gc_content.tsv

Output Format

PlotHist generates one .tsv file per input column with histogram data:

Filename: [column_name].tsv
Format: Two columns - bin start value and count
Bins: Equal-width bins spanning 0 to maximum value in each column
Precision: Bin values formatted to 6 decimal places

Example Output (quality.tsv)

quality	count
0.000000	0
0.035000	5
0.070000	12
0.105000	23

Each row represents a bin with its starting value and the number of data points in that bin.

Algorithm Details

PlotHist uses a dual-pass algorithm optimized for memory efficiency and accurate binning:

Two-Pass Processing Strategy

First Pass (Data Discovery): Reads the entire file to determine the maximum value in each column. This establishes the bin width for each histogram (max_value / bins).
Second Pass (Binning): Re-reads the file to assign each data point to its appropriate bin using the formula: bin = (value / max_value) * bins.

Memory Management

Count Matrix: Uses a 2D long array [terms][bins+1] to store histogram counts
Dynamic Sizing: Matrix size determined by actual number of columns and specified bin count
File Reset: Input file is reset between passes using ByteFile.reset() for efficient re-reading

Precision Handling

Double Precision: All calculations use double-precision arithmetic to maintain accuracy
Bin Boundary: Values exactly equal to the maximum are placed in the last valid bin (bins-1) rather than creating an overflow bin
Output Precision: Bin start values formatted to 6 decimal places for consistent representation

Performance Characteristics

Time Complexity: O(2n) where n is the number of data points (two file passes)
Space Complexity: O(columns * bins) for the count matrix plus O(columns) for metadata
Memory Usage: Minimal during processing - only the count matrix and current line are held in memory
I/O Pattern: Sequential reads with file reset - efficient for large files

Error Handling

Invalid Data: Non-numeric values are skipped with assertion checking
Missing Headers: Files without proper headers will cause processing errors
Empty Columns: Columns with no valid data will have zero maximum values
File Access: Comprehensive validation of input file accessibility before processing

Performance Guidelines

Memory Requirements

Default: 300MB should handle most typical use cases
Large Files: Increase -Xmx for files with many columns or very high bin counts
Formula: Approximate memory = columns × bins × 8 bytes + file buffering

Optimization Tips

Bin Count: Balance resolution needs with memory usage - 1000 bins often sufficient
File Format: Ensure clean tab-delimited format for fastest processing
Column Order: Processing order follows input column order

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org