PlotHist

Script: plothist.sh Package: hiseq Class: PlotHist.java

Generates histograms from a tile dump. Also works on other 2D numeric matrices with a header. Output files are automatically named from the header columns.

Basic Usage

plothist.sh in=<input file> bins=<number>

PlotHist reads tab-delimited numeric data files with headers and generates histogram data for each column. The tool is specifically designed for processing tile dump data but can handle any 2D numeric matrix format. Each column in the input generates a separate .tsv output file containing binned frequency data.

Parameters

Parameters control input processing, histogram generation, and output behavior. The tool uses a two-pass algorithm to first determine data ranges then generate binned histograms.

Parameters

in=<file>
Input dump file. Must be tab-delimited with a header line starting with '#'. Each column should contain numeric data. The header defines the output filenames (column_name.tsv).
bins=1000
Bins per histogram. Determines the resolution of the output histograms. Each column's data range is divided into this many equal-width bins. Higher values provide more detailed distributions but larger output files.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file. When true, existing output files will be replaced without warning.
verbose=f
Print verbose messages during processing. Enables detailed logging of file operations and processing statistics. Useful for debugging or monitoring large dataset processing.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default is 300m for PlotHist.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to detect memory issues early.
-da
Disable assertions. Can provide minor performance improvements in production environments by skipping internal consistency checks.

Examples

Basic Histogram Generation

plothist.sh in=tile_data.txt bins=500

Creates histograms from tile_data.txt using 500 bins per column. Each column in the file will generate a separate .tsv file (e.g., quality.tsv, intensity.tsv).

High-Resolution Analysis

plothist.sh in=sequencing_metrics.txt bins=2000 overwrite=f

Generates high-resolution histograms with 2000 bins and prevents overwriting existing files. Useful for detailed distribution analysis.

Verbose Processing

plothist.sh in=large_dataset.txt bins=1000 verbose=t -Xmx8g

Processes a large dataset with verbose logging and increased memory allocation. The verbose output helps monitor progress on large files.

Input Format

PlotHist expects tab-delimited input files with a specific format:

Example Input Format

#quality	intensity	gc_content
28.5	1245.2	0.42
31.2	1389.7	0.38
# This is a comment
29.8	1156.3	0.45

This would generate three output files: quality.tsv, intensity.tsv, and gc_content.tsv

Output Format

PlotHist generates one .tsv file per input column with histogram data:

Example Output (quality.tsv)

quality	count
0.000000	0
0.035000	5
0.070000	12
0.105000	23

Each row represents a bin with its starting value and the number of data points in that bin.

Algorithm Details

PlotHist uses a dual-pass algorithm optimized for memory efficiency and accurate binning:

Two-Pass Processing Strategy

Memory Management

Precision Handling

Performance Characteristics

Error Handling

Performance Guidelines

Memory Requirements

Optimization Tips

Support

For questions and support: