ReduceColumns
Utility for extracting specific columns from tab-delimited files while maintaining dimension metadata. Designed for reducing dimensionality in machine learning datasets.
Basic Usage
reducecolumns.sh <input_file> <output_file> column1 column2 column3 ...
This tool extracts specified columns from a tab-delimited input file and writes them to an output file. Column numbers are zero-indexed.
Parameters
This tool uses positional arguments rather than named parameters:
Positional Arguments
- <input_file>
- Path to the input tab-delimited file to process
- <output_file>
- Path to the output file where selected columns will be written
- column1 column2 ...
- Zero-indexed column numbers to extract from the input file. Multiple columns can be specified separated by spaces
Memory Parameters
- -Xmx
- Maximum heap memory (default: 2g). Automatically calculated based on available system memory
- -Xms
- Initial heap memory (default: 2g). Set to match -Xmx for consistent performance
Examples
Extract First Three Columns
reducecolumns.sh input.txt output.txt 0 1 2
Extracts columns 0, 1, and 2 from input.txt and writes them to output.txt
Extract Non-Sequential Columns
reducecolumns.sh data.tsv subset.tsv 0 3 7 12
Extracts columns 0, 3, 7, and 12 from data.tsv, maintaining their order in the output
Single Column Extraction
reducecolumns.sh matrix.txt column5.txt 5
Extracts only column 5 from matrix.txt and saves it to column5.txt
Algorithm Details
ReduceColumns implements column extraction using ByteFile and ByteStreamWriter classes for file I/O with LineParser1 for tab-delimited parsing:
Processing Strategy
- Line-by-line parsing: Uses LineParser1 class with tab delimiter ('\t') for parsing each input line
- Column indexing: Maintains IntList data structure storing target column indices for array-based access
- Dimension preservation: Writes "#dims [columns.size()-1] 1" header to output file before processing data lines
- Memory management: Processes files via ListNum<byte[]> chunks using ByteFile.nextList() iteration
File Format Handling
- Tab-delimited input: LineParser1 splits input lines on tab characters for column extraction
- Header preservation: Skips lines starting with '#' character during data processing
- Dimension metadata: Outputs dimension count as "#dims [num_selected_columns-1] 1" format
- Order preservation: Iterates through IntList maintaining command-line argument order
Performance Characteristics
- Linear complexity: O(n × m) where n is number of rows processed and m is selected column count
- Chunk-based processing: Uses ListNum<byte[]> for processing lines in batches rather than individual line reads
- I/O implementation: Uses ByteStreamWriter with print() and tab() methods for output generation
- File handling: ByteFile class handles input file reading with automatic resource management via poisonAndWait()
Use Cases
- Feature selection: Reducing dimensionality in machine learning datasets
- Data preprocessing: Extracting relevant columns for downstream analysis
- Format conversion: Reformatting data files for specific tools or pipelines
- Dataset subsetting: Creating focused datasets from larger data matrices
Input/Output Format
Input Requirements
- Tab-delimited text file
- Consistent number of columns per row
- Optional dimension metadata in header lines starting with '#'
Output Format
- Tab-delimited text with selected columns
- Updated dimension metadata: "#dims [num_selected_columns-1] 1"
- Preserved comment lines and headers from input
- Same row order as input file
Technical Notes
Column Indexing
Columns are zero-indexed, meaning the first column is 0, second column is 1, etc. This follows standard programming conventions.
Error Handling
- Invalid column numbers will cause the program to fail
- Column numbers greater than available columns will result in errors
- Ensure input file is properly tab-delimited
Memory Management
Default memory allocation is 2GB, automatically adjusted based on system availability. For very large files, memory usage remains constant due to streaming processing.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org