MatrixToColumns
Transforms two matched identity matrices into 2-column format, one row per entry, one column per matrix.
Basic Usage
matrixtocolumns.sh in1=<matrix1> in2=<matrix2> out=<file>
This tool takes two input identity matrices and converts them into a two-column tab-delimited format where each row represents a paired entry from the same position in both matrices.
Parameters
Parameters are organized by their function in the matrix transformation process. The tool reads corresponding entries from two identity matrices and outputs them as paired columns.
Input/Output Parameters
- in1=<matrix1>
- First input matrix file. Required parameter specifying the path to the first identity matrix.
- in2=<matrix2>
- Second input matrix file. Required parameter specifying the path to the second identity matrix.
- out=<file>
- Output file for the two-column format results. Required parameter specifying where to write the transformed matrix data.
- overwrite=true
- Allow overwriting of existing output files. Default: true
Processing Parameters
- samplerate=1.0
- Fraction of matrix entries to include in output (0.0-1.0). Default: 1.0 (include all entries). Note: Current implementation processes all entries regardless of this parameter value.
- sampleseed=-1
- Random seed parameter. Default: -1. Note: Current implementation uses Collections.shuffle() without explicit seeding.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Matrix Transformation
matrixtocolumns.sh in1=identity_matrix1.txt in2=identity_matrix2.txt out=correlation_data.txt
Transforms two identity matrices into a two-column format suitable for correlation analysis or plotting.
Sampling Matrix Data
matrixtocolumns.sh in1=large_matrix1.txt in2=large_matrix2.txt out=sampled_data.txt samplerate=0.1 sampleseed=12345
Processes only 10% of the matrix entries using reproducible random sampling with seed 12345. Useful for large matrices where a subset is sufficient for analysis.
Processing with Custom Memory
matrixtocolumns.sh -Xmx8g in1=matrix1.txt in2=matrix2.txt out=results.txt
Allocates 8GB of memory for processing large identity matrices.
Algorithm Details
The matrixtocolumns tool implements a straightforward matrix transformation algorithm designed for processing identity matrices:
Processing Strategy
The algorithm uses a three-phase approach:
- Matrix Loading: Reads both input matrices completely into memory using TextFile.doublesplitWhitespace() method. Matrices are stored as String[][] arrays for flexible handling of numeric and text data.
- Entry Extraction: Iterates through the lower triangular portion using nested loops: outer loop (i=0; i<matrix1.length; i++) and inner loop (j=1; j<=i; j++). Each iteration extracts matrix1[i][j] and matrix2[i][j] pairs.
- Output Generation: Pairs are collected in an ArrayList<String[]>, randomized using Collections.shuffle(), then written via TextStreamWriter with tab delimiter (\t) between paired values.
Memory Efficiency
The implementation loads entire matrices into memory, making it suitable for moderately-sized identity matrices but potentially memory-intensive for very large datasets. Memory usage scales with matrix size squared (O(n²) for n×n matrices).
Data Handling
The tool preserves the original data format from the matrices, handling both numeric and text entries without conversion. Output maintains the precision and format of the source data. The shuffling process ensures random distribution while preserving exact pairing between matrices.
File Format Compatibility
Input matrices should be whitespace-delimited text files where each row represents a matrix row and columns are separated by spaces or tabs. The tool expects both matrices to have identical dimensions and structure.
Technical Notes
Matrix Requirements
- Both input matrices must have identical dimensions
- Matrices should be whitespace-delimited text format
- The algorithm processes the lower triangular portion (including diagonal)
- Missing or malformed entries will cause processing errors
Performance Characteristics
- Memory usage: O(n²) where n is matrix dimension
- Processing time: Linear with number of matrix entries processed
- I/O: Single pass through input files, single write to output
- Randomization: Uses Collections.shuffle() with default Random() constructor (time-seeded) for unbiased ordering
Output Format
The output consists of tab-delimited rows where:
- Column 1: Entry from first matrix (in1)
- Column 2: Corresponding entry from second matrix (in2)
- Row order: Randomized to eliminate position bias
- Entry count: n(n+1)/2 for n×n symmetric matrices
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org