PlotHist
Generates histograms from a tile dump. Also works on other 2D numeric matrices with a header. Output files are automatically named from the header columns.
Basic Usage
plothist.sh in=<input file> bins=<number>
PlotHist reads tab-delimited numeric data files with headers and generates histogram data for each column. The tool is specifically designed for processing tile dump data but can handle any 2D numeric matrix format. Each column in the input generates a separate .tsv output file containing binned frequency data.
Parameters
Parameters control input processing, histogram generation, and output behavior. The tool uses a two-pass algorithm to first determine data ranges then generate binned histograms.
Parameters
- in=<file>
- Input dump file. Must be tab-delimited with a header line starting with '#'. Each column should contain numeric data. The header defines the output filenames (column_name.tsv).
- bins=1000
- Bins per histogram. Determines the resolution of the output histograms. Each column's data range is divided into this many equal-width bins. Higher values provide more detailed distributions but larger output files.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file. When true, existing output files will be replaced without warning.
- verbose=f
- Print verbose messages during processing. Enables detailed logging of file operations and processing statistics. Useful for debugging or monitoring large dataset processing.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default is 300m for PlotHist.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to detect memory issues early.
- -da
- Disable assertions. Can provide minor performance improvements in production environments by skipping internal consistency checks.
Examples
Basic Histogram Generation
plothist.sh in=tile_data.txt bins=500
Creates histograms from tile_data.txt using 500 bins per column. Each column in the file will generate a separate .tsv file (e.g., quality.tsv, intensity.tsv).
High-Resolution Analysis
plothist.sh in=sequencing_metrics.txt bins=2000 overwrite=f
Generates high-resolution histograms with 2000 bins and prevents overwriting existing files. Useful for detailed distribution analysis.
Verbose Processing
plothist.sh in=large_dataset.txt bins=1000 verbose=t -Xmx8g
Processes a large dataset with verbose logging and increased memory allocation. The verbose output helps monitor progress on large files.
Input Format
PlotHist expects tab-delimited input files with a specific format:
- Header line: Must start with '#' followed by tab-separated column names
- Data lines: Tab-delimited numeric values, one per column
- Comments: Lines starting with '#' (except the header) are ignored
- Empty lines: Ignored during processing
Example Input Format
#quality intensity gc_content
28.5 1245.2 0.42
31.2 1389.7 0.38
# This is a comment
29.8 1156.3 0.45
This would generate three output files: quality.tsv, intensity.tsv, and gc_content.tsv
Output Format
PlotHist generates one .tsv file per input column with histogram data:
- Filename: [column_name].tsv
- Format: Two columns - bin start value and count
- Bins: Equal-width bins spanning 0 to maximum value in each column
- Precision: Bin values formatted to 6 decimal places
Example Output (quality.tsv)
quality count
0.000000 0
0.035000 5
0.070000 12
0.105000 23
Each row represents a bin with its starting value and the number of data points in that bin.
Algorithm Details
PlotHist uses a dual-pass algorithm optimized for memory efficiency and accurate binning:
Two-Pass Processing Strategy
- First Pass (Data Discovery): Reads the entire file to determine the maximum value in each column. This establishes the bin width for each histogram (max_value / bins).
- Second Pass (Binning): Re-reads the file to assign each data point to its appropriate bin using the formula: bin = (value / max_value) * bins.
Memory Management
- Count Matrix: Uses a 2D long array [terms][bins+1] to store histogram counts
- Dynamic Sizing: Matrix size determined by actual number of columns and specified bin count
- File Reset: Input file is reset between passes using ByteFile.reset() for efficient re-reading
Precision Handling
- Double Precision: All calculations use double-precision arithmetic to maintain accuracy
- Bin Boundary: Values exactly equal to the maximum are placed in the last valid bin (bins-1) rather than creating an overflow bin
- Output Precision: Bin start values formatted to 6 decimal places for consistent representation
Performance Characteristics
- Time Complexity: O(2n) where n is the number of data points (two file passes)
- Space Complexity: O(columns * bins) for the count matrix plus O(columns) for metadata
- Memory Usage: Minimal during processing - only the count matrix and current line are held in memory
- I/O Pattern: Sequential reads with file reset - efficient for large files
Error Handling
- Invalid Data: Non-numeric values are skipped with assertion checking
- Missing Headers: Files without proper headers will cause processing errors
- Empty Columns: Columns with no valid data will have zero maximum values
- File Access: Comprehensive validation of input file accessibility before processing
Performance Guidelines
Memory Requirements
- Default: 300MB should handle most typical use cases
- Large Files: Increase -Xmx for files with many columns or very high bin counts
- Formula: Approximate memory = columns × bins × 8 bytes + file buffering
Optimization Tips
- Bin Count: Balance resolution needs with memory usage - 1000 bins often sufficient
- File Format: Ensure clean tab-delimited format for fastest processing
- Column Order: Processing order follows input column order
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org