FilterLines

Script: filterlines.sh Package: driver Class: FilterLines.java

Filters lines by exact match or substring. This tool processes text files line by line, comparing each line against a list of filter names to determine whether the line should be included or excluded from the output based on various matching criteria.

Basic Usage

filterlines.sh in=<file> out=<file> names=<file> include=<t/f>

The tool reads lines from the input file and filters them based on names provided in the names parameter. By default, matching lines are excluded (filtered out), but this can be reversed with the include parameter.

Parameters

Parameters control how lines are matched against the filter names and how the filtering process operates.

Parameters

include=f
Set to 'true' to include the filtered names rather than excluding them. When false (default), matching lines are excluded from output. When true, only matching lines are included in output.
prefix=f
Allow matching of only the line's prefix (all characters up to first whitespace). When enabled, only the portion of each line before the first whitespace character is used for matching comparisons.
substring=f
Allow one name to be a substring of the other, rather than a full match. Options:
  • f: No substring matching - requires exact matches
  • t: Bidirectional substring matching - allows both directions
  • line: Allow input lines to be substrings of names in list
  • name: Allow names in list to be substrings of input lines
case=t
(casesensitive) Match case also. When true (default), matching is case-sensitive. When false, all comparisons are performed in lowercase for case-insensitive matching.
ow=t
(overwrite) Overwrites files that already exist. When true (default), existing output files will be overwritten without warning.
app=f
(append) Append to files that already exist. When true, output is appended to existing files rather than overwriting them.
zl=4
(ziplevel) Set compression level, 1 (low) to 9 (max). Controls the compression level when writing compressed output files.
names=
A list of strings or files, comma-delimited. Files must have one name per line. This parameter specifies the filter criteria - can be literal strings separated by commas, or file paths containing one name per line. Multiple files can be specified.
lines=
(maxlines) Maximum number of lines to process from the input file. Processing stops when this limit is reached, useful for testing or processing only a portion of large files.
replace=
Replace text in lines before matching. Format: old,new - replaces all instances of 'old' with 'new' in each line before performing the matching operation.
verbose=f
Enable verbose output for debugging and detailed processing information.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Line Filtering

filterlines.sh in=data.txt out=filtered.txt names=unwanted.txt

Removes all lines from data.txt that match any name listed in unwanted.txt (one name per line).

Include Matching Lines

filterlines.sh in=data.txt out=wanted.txt names=keep.txt include=t

Keeps only lines that match names in keep.txt, discarding all others.

Substring Matching

filterlines.sh in=sequences.txt out=filtered.txt names=patterns.txt substring=name

Filters lines where any pattern in patterns.txt is a substring of the input line.

Case-Insensitive Prefix Matching

filterlines.sh in=headers.txt out=clean.txt names=sample1,sample2,sample3 prefix=t case=f

Filters lines based on prefix matching (up to first whitespace) in a case-insensitive manner.

Text Replacement During Filtering

filterlines.sh in=data.txt out=clean.txt names=badwords.txt replace=old_term,new_term

Replaces "old_term" with "new_term" in each line before performing the filtering operation.

Algorithm Details

Core Filtering Implementation

FilterLines implements a streaming text filtering system using hash-based name lookup with configurable matching criteria. The implementation prioritizes memory conservation by processing files line-by-line without loading entire contents into memory.

Name Storage and Lookup

Line Processing Pipeline

The filtering process follows this execution sequence for each input line (process() method, lines 157-196):

  1. Line Reading: TextFile.readLine() with trim enabled for whitespace normalization
  2. Preprocessing: Case conversion using toLowerCase() if ignoreCase enabled, followed by String.replace() for text substitution
  3. Prefix Extraction: Character-by-character scan using Character.isWhitespace() to identify prefix boundaries (lines 164-173)
  4. Hash Lookup: LinkedHashSet.contains() for O(1) exact matching of line or prefix
  5. Substring Scanning: Enhanced for-each loop through name set with String.contains() calls for substring matching (lines 179-184)
  6. Boolean Logic: XOR operation (match != exclude) determines final line retention (line 185)

Substring Matching Implementation

Performance Characteristics

I/O Implementation Details

Statistics and Monitoring

Support

For questions and support: