MatrixToColumns

Basic Usage

matrixtocolumns.sh in1=<matrix1> in2=<matrix2> out=<file>

This tool takes two input identity matrices and converts them into a two-column tab-delimited format where each row represents a paired entry from the same position in both matrices.

Parameters

Parameters are organized by their function in the matrix transformation process. The tool reads corresponding entries from two identity matrices and outputs them as paired columns.

Input/Output Parameters

in1=<matrix1>: First input matrix file. Required parameter specifying the path to the first identity matrix.
in2=<matrix2>: Second input matrix file. Required parameter specifying the path to the second identity matrix.
out=<file>: Output file for the two-column format results. Required parameter specifying where to write the transformed matrix data.
overwrite=true: Allow overwriting of existing output files. Default: true

Processing Parameters

samplerate=1.0: Fraction of matrix entries to include in output (0.0-1.0). Default: 1.0 (include all entries). Note: Current implementation processes all entries regardless of this parameter value.
sampleseed=-1: Random seed parameter. Default: -1. Note: Current implementation uses Collections.shuffle() without explicit seeding.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Matrix Transformation

matrixtocolumns.sh in1=identity_matrix1.txt in2=identity_matrix2.txt out=correlation_data.txt

Transforms two identity matrices into a two-column format suitable for correlation analysis or plotting.

Sampling Matrix Data

matrixtocolumns.sh in1=large_matrix1.txt in2=large_matrix2.txt out=sampled_data.txt samplerate=0.1 sampleseed=12345

Processes only 10% of the matrix entries using reproducible random sampling with seed 12345. Useful for large matrices where a subset is sufficient for analysis.

Processing with Custom Memory

matrixtocolumns.sh -Xmx8g in1=matrix1.txt in2=matrix2.txt out=results.txt

Allocates 8GB of memory for processing large identity matrices.

Algorithm Details

The matrixtocolumns tool implements a straightforward matrix transformation algorithm designed for processing identity matrices:

Processing Strategy

The algorithm uses a three-phase approach:

Matrix Loading: Reads both input matrices completely into memory using TextFile.doublesplitWhitespace() method. Matrices are stored as String[][] arrays for flexible handling of numeric and text data.
Entry Extraction: Iterates through the lower triangular portion using nested loops: outer loop (i=0; i<matrix1.length; i++) and inner loop (j=1; j<=i; j++). Each iteration extracts matrix1[i][j] and matrix2[i][j] pairs.
Output Generation: Pairs are collected in an ArrayList<String[]>, randomized using Collections.shuffle(), then written via TextStreamWriter with tab delimiter (\t) between paired values.

Memory Efficiency

The implementation loads entire matrices into memory, making it suitable for moderately-sized identity matrices but potentially memory-intensive for very large datasets. Memory usage scales with matrix size squared (O(n²) for n×n matrices).

Data Handling

The tool preserves the original data format from the matrices, handling both numeric and text entries without conversion. Output maintains the precision and format of the source data. The shuffling process ensures random distribution while preserving exact pairing between matrices.

File Format Compatibility

Input matrices should be whitespace-delimited text files where each row represents a matrix row and columns are separated by spaces or tabs. The tool expects both matrices to have identical dimensions and structure.

Technical Notes

Matrix Requirements

Both input matrices must have identical dimensions
Matrices should be whitespace-delimited text format
The algorithm processes the lower triangular portion (including diagonal)
Missing or malformed entries will cause processing errors

Performance Characteristics

Memory usage: O(n²) where n is matrix dimension
Processing time: Linear with number of matrix entries processed
I/O: Single pass through input files, single write to output
Randomization: Uses Collections.shuffle() with default Random() constructor (time-seeded) for unbiased ordering

Output Format

The output consists of tab-delimited rows where:

Column 1: Entry from first matrix (in1)
Column 2: Corresponding entry from second matrix (in2)
Row order: Randomized to eliminate position bias
Entry count: n(n+1)/2 for n×n symmetric matrices

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org