FilterLines

Basic Usage

filterlines.sh in=<file> out=<file> names=<file> include=<t/f>

The tool reads lines from the input file and filters them based on names provided in the names parameter. By default, matching lines are excluded (filtered out), but this can be reversed with the include parameter.

Parameters

Parameters control how lines are matched against the filter names and how the filtering process operates.

Parameters

include=f

Set to 'true' to include the filtered names rather than excluding them. When false (default), matching lines are excluded from output. When true, only matching lines are included in output.

prefix=f

Allow matching of only the line's prefix (all characters up to first whitespace). When enabled, only the portion of each line before the first whitespace character is used for matching comparisons.

substring=f

Allow one name to be a substring of the other, rather than a full match. Options:

f: No substring matching - requires exact matches
t: Bidirectional substring matching - allows both directions
line: Allow input lines to be substrings of names in list
name: Allow names in list to be substrings of input lines

case=t

(casesensitive) Match case also. When true (default), matching is case-sensitive. When false, all comparisons are performed in lowercase for case-insensitive matching.

ow=t

(overwrite) Overwrites files that already exist. When true (default), existing output files will be overwritten without warning.

app=f

(append) Append to files that already exist. When true, output is appended to existing files rather than overwriting them.

zl=4

(ziplevel) Set compression level, 1 (low) to 9 (max). Controls the compression level when writing compressed output files.

names=

A list of strings or files, comma-delimited. Files must have one name per line. This parameter specifies the filter criteria - can be literal strings separated by commas, or file paths containing one name per line. Multiple files can be specified.

lines=

(maxlines) Maximum number of lines to process from the input file. Processing stops when this limit is reached, useful for testing or processing only a portion of large files.

replace=

Replace text in lines before matching. Format: old,new - replaces all instances of 'old' with 'new' in each line before performing the matching operation.

verbose=f

Enable verbose output for debugging and detailed processing information.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Line Filtering

filterlines.sh in=data.txt out=filtered.txt names=unwanted.txt

Removes all lines from data.txt that match any name listed in unwanted.txt (one name per line).

Include Matching Lines

filterlines.sh in=data.txt out=wanted.txt names=keep.txt include=t

Keeps only lines that match names in keep.txt, discarding all others.

Substring Matching

filterlines.sh in=sequences.txt out=filtered.txt names=patterns.txt substring=name

Filters lines where any pattern in patterns.txt is a substring of the input line.

Case-Insensitive Prefix Matching

filterlines.sh in=headers.txt out=clean.txt names=sample1,sample2,sample3 prefix=t case=f

Filters lines based on prefix matching (up to first whitespace) in a case-insensitive manner.

Text Replacement During Filtering

filterlines.sh in=data.txt out=clean.txt names=badwords.txt replace=old_term,new_term

Replaces "old_term" with "new_term" in each line before performing the filtering operation.

Algorithm Details

Core Filtering Implementation

FilterLines implements a streaming text filtering system using hash-based name lookup with configurable matching criteria. The implementation prioritizes memory conservation by processing files line-by-line without loading entire contents into memory.

Name Storage and Lookup

Filter names are stored in a LinkedHashSet<String> for O(1) hash-based lookup performance while preserving insertion order
Case-insensitive mode applies String.toLowerCase() transformation to all names during initialization (lines 109-115)
File-based name lists are processed using Tools.addNames() which parses comma-delimited inputs and handles both literal strings and file paths with line-by-line reading (lines 103-108)

Line Processing Pipeline

The filtering process follows this execution sequence for each input line (process() method, lines 157-196):

Line Reading: TextFile.readLine() with trim enabled for whitespace normalization
Preprocessing: Case conversion using toLowerCase() if ignoreCase enabled, followed by String.replace() for text substitution
Prefix Extraction: Character-by-character scan using Character.isWhitespace() to identify prefix boundaries (lines 164-173)
Hash Lookup: LinkedHashSet.contains() for O(1) exact matching of line or prefix
Substring Scanning: Enhanced for-each loop through name set with String.contains() calls for substring matching (lines 179-184)
Boolean Logic: XOR operation (match != exclude) determines final line retention (line 185)

Substring Matching Implementation

Bidirectional Mode: Both nameSubstringOfLine and lineSubstringOfName flags enabled for comprehensive substring detection
Direction Control: lineSubstringOfName checks if line is contained within any name using String.contains()
Pattern Matching: nameSubstringOfLine checks if any name is contained within the line
Prefix Integration: Substring matching applies to both full lines and extracted prefixes when prefix mode is active

Performance Characteristics

Memory Usage: O(n) where n = number of unique filter names, independent of input file size due to streaming processing
Time Complexity: O(1) per line for exact matching, O(n*m) for substring matching where n = name count, m = average name length
Processing Rate: Calculated as linesProcessed/(t.elapsed) with nanosecond precision timing using Timer class
Whitespace Detection: Character.isWhitespace() provides Unicode-compliant whitespace identification for prefix extraction

I/O Implementation Details

Input Processing: TextFile class with FileFormat.testInput() for format detection and compression handling
Output Writing: TextStreamWriter with concurrent processing using start() and poisonAndWait() methods
Buffer Management: Shared.capBuffers(4) limits buffer pool size to 4 buffers for memory control
Compression Support: ReadWrite.USE_PIGZ enables parallel gzip compression with configurable thread count

Statistics and Monitoring

Counters: Maintains linesProcessed, linesOut, and bytesOut counters for comprehensive processing metrics
Rate Calculation: Processing rate reported as "k reads/sec" using Tools.format() with 2 decimal precision
Error State Tracking: Boolean errorState flag aggregated from TextFile.close() and TextStreamWriter operations
Line Limiting: maxLines parameter enforced using >= comparison to enable precise processing truncation

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org