ReduceColumns

Script: reducecolumns.sh Package: ml Class: ReduceColumns.java

Utility for extracting specific columns from tab-delimited files while maintaining dimension metadata. Designed for reducing dimensionality in machine learning datasets.

Basic Usage

reducecolumns.sh <input_file> <output_file> column1 column2 column3 ...

This tool extracts specified columns from a tab-delimited input file and writes them to an output file. Column numbers are zero-indexed.

Parameters

This tool uses positional arguments rather than named parameters:

Positional Arguments

<input_file>
Path to the input tab-delimited file to process
<output_file>
Path to the output file where selected columns will be written
column1 column2 ...
Zero-indexed column numbers to extract from the input file. Multiple columns can be specified separated by spaces

Memory Parameters

-Xmx
Maximum heap memory (default: 2g). Automatically calculated based on available system memory
-Xms
Initial heap memory (default: 2g). Set to match -Xmx for consistent performance

Examples

Extract First Three Columns

reducecolumns.sh input.txt output.txt 0 1 2

Extracts columns 0, 1, and 2 from input.txt and writes them to output.txt

Extract Non-Sequential Columns

reducecolumns.sh data.tsv subset.tsv 0 3 7 12

Extracts columns 0, 3, 7, and 12 from data.tsv, maintaining their order in the output

Single Column Extraction

reducecolumns.sh matrix.txt column5.txt 5

Extracts only column 5 from matrix.txt and saves it to column5.txt

Algorithm Details

ReduceColumns implements column extraction using ByteFile and ByteStreamWriter classes for file I/O with LineParser1 for tab-delimited parsing:

Processing Strategy

File Format Handling

Performance Characteristics

Use Cases

Input/Output Format

Input Requirements

Output Format

Technical Notes

Column Indexing

Columns are zero-indexed, meaning the first column is 0, second column is 1, etc. This follows standard programming conventions.

Error Handling

Memory Management

Default memory allocation is 2GB, automatically adjusted based on system availability. For very large files, memory usage remains constant due to streaming processing.

Support

For questions and support: