FilterLines
Filters lines by exact match or substring. This tool processes text files line by line, comparing each line against a list of filter names to determine whether the line should be included or excluded from the output based on various matching criteria.
Basic Usage
filterlines.sh in=<file> out=<file> names=<file> include=<t/f>
The tool reads lines from the input file and filters them based on names provided in the names parameter. By default, matching lines are excluded (filtered out), but this can be reversed with the include parameter.
Parameters
Parameters control how lines are matched against the filter names and how the filtering process operates.
Parameters
- include=f
- Set to 'true' to include the filtered names rather than excluding them. When false (default), matching lines are excluded from output. When true, only matching lines are included in output.
- prefix=f
- Allow matching of only the line's prefix (all characters up to first whitespace). When enabled, only the portion of each line before the first whitespace character is used for matching comparisons.
- substring=f
- Allow one name to be a substring of the other, rather than a full match. Options:
f
: No substring matching - requires exact matchest
: Bidirectional substring matching - allows both directionsline
: Allow input lines to be substrings of names in listname
: Allow names in list to be substrings of input lines
- case=t
- (casesensitive) Match case also. When true (default), matching is case-sensitive. When false, all comparisons are performed in lowercase for case-insensitive matching.
- ow=t
- (overwrite) Overwrites files that already exist. When true (default), existing output files will be overwritten without warning.
- app=f
- (append) Append to files that already exist. When true, output is appended to existing files rather than overwriting them.
- zl=4
- (ziplevel) Set compression level, 1 (low) to 9 (max). Controls the compression level when writing compressed output files.
- names=
- A list of strings or files, comma-delimited. Files must have one name per line. This parameter specifies the filter criteria - can be literal strings separated by commas, or file paths containing one name per line. Multiple files can be specified.
- lines=
- (maxlines) Maximum number of lines to process from the input file. Processing stops when this limit is reached, useful for testing or processing only a portion of large files.
- replace=
- Replace text in lines before matching. Format: old,new - replaces all instances of 'old' with 'new' in each line before performing the matching operation.
- verbose=f
- Enable verbose output for debugging and detailed processing information.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Line Filtering
filterlines.sh in=data.txt out=filtered.txt names=unwanted.txt
Removes all lines from data.txt that match any name listed in unwanted.txt (one name per line).
Include Matching Lines
filterlines.sh in=data.txt out=wanted.txt names=keep.txt include=t
Keeps only lines that match names in keep.txt, discarding all others.
Substring Matching
filterlines.sh in=sequences.txt out=filtered.txt names=patterns.txt substring=name
Filters lines where any pattern in patterns.txt is a substring of the input line.
Case-Insensitive Prefix Matching
filterlines.sh in=headers.txt out=clean.txt names=sample1,sample2,sample3 prefix=t case=f
Filters lines based on prefix matching (up to first whitespace) in a case-insensitive manner.
Text Replacement During Filtering
filterlines.sh in=data.txt out=clean.txt names=badwords.txt replace=old_term,new_term
Replaces "old_term" with "new_term" in each line before performing the filtering operation.
Algorithm Details
Core Filtering Implementation
FilterLines implements a streaming text filtering system using hash-based name lookup with configurable matching criteria. The implementation prioritizes memory conservation by processing files line-by-line without loading entire contents into memory.
Name Storage and Lookup
- Filter names are stored in a LinkedHashSet<String> for O(1) hash-based lookup performance while preserving insertion order
- Case-insensitive mode applies String.toLowerCase() transformation to all names during initialization (lines 109-115)
- File-based name lists are processed using Tools.addNames() which parses comma-delimited inputs and handles both literal strings and file paths with line-by-line reading (lines 103-108)
Line Processing Pipeline
The filtering process follows this execution sequence for each input line (process() method, lines 157-196):
- Line Reading: TextFile.readLine() with trim enabled for whitespace normalization
- Preprocessing: Case conversion using toLowerCase() if ignoreCase enabled, followed by String.replace() for text substitution
- Prefix Extraction: Character-by-character scan using Character.isWhitespace() to identify prefix boundaries (lines 164-173)
- Hash Lookup: LinkedHashSet.contains() for O(1) exact matching of line or prefix
- Substring Scanning: Enhanced for-each loop through name set with String.contains() calls for substring matching (lines 179-184)
- Boolean Logic: XOR operation (match != exclude) determines final line retention (line 185)
Substring Matching Implementation
- Bidirectional Mode: Both nameSubstringOfLine and lineSubstringOfName flags enabled for comprehensive substring detection
- Direction Control: lineSubstringOfName checks if line is contained within any name using String.contains()
- Pattern Matching: nameSubstringOfLine checks if any name is contained within the line
- Prefix Integration: Substring matching applies to both full lines and extracted prefixes when prefix mode is active
Performance Characteristics
- Memory Usage: O(n) where n = number of unique filter names, independent of input file size due to streaming processing
- Time Complexity: O(1) per line for exact matching, O(n*m) for substring matching where n = name count, m = average name length
- Processing Rate: Calculated as linesProcessed/(t.elapsed) with nanosecond precision timing using Timer class
- Whitespace Detection: Character.isWhitespace() provides Unicode-compliant whitespace identification for prefix extraction
I/O Implementation Details
- Input Processing: TextFile class with FileFormat.testInput() for format detection and compression handling
- Output Writing: TextStreamWriter with concurrent processing using start() and poisonAndWait() methods
- Buffer Management: Shared.capBuffers(4) limits buffer pool size to 4 buffers for memory control
- Compression Support: ReadWrite.USE_PIGZ enables parallel gzip compression with configurable thread count
Statistics and Monitoring
- Counters: Maintains linesProcessed, linesOut, and bytesOut counters for comprehensive processing metrics
- Rate Calculation: Processing rate reported as "k reads/sec" using Tools.format() with 2 decimal precision
- Error State Tracking: Boolean errorState flag aggregated from TextFile.close() and TextStreamWriter operations
- Line Limiting: maxLines parameter enforced using >= comparison to enable precise processing truncation
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org