FilterByName
Filters reads by name using exact matching, substring matching, or prefix matching. Supports both inclusion and exclusion modes, paired reads, and positional trimming of output sequences.
Basic Usage
filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>
Input files can be fasta, fastq, or sam format (optionally gzipped). Second input and output files are for paired reads and are optional. If input is paired and there is only one output file, it will be written interleaved.
names=e.coli_K12
(correct) rather than names=>e.coli_K12
or names=@e.coli_K12
(incorrect).
Parameters
Parameters are organized by their function in the filtering process. The tool supports multiple matching modes and output options for flexible read selection.
Core Parameters
- include=f
- Set to 'true' to include the filtered names rather than excluding them. When false (default), reads matching the names list are excluded from output.
- names=
- A list of strings or files containing names to match. Can be comma-separated strings, or files with one name per line, or standard read files (fasta, fastq, or sam). Multiple files and strings can be combined.
- minlen=0
- Do not output reads shorter than this length. Applied after any positional trimming.
Matching Parameters
- substring=f
- Allow one name to be a substring of the other, rather than requiring a full match.
- f: No substring matching (exact match required)
- t: Bidirectional substring matching
- header: Allow input read headers to be substrings of names in list
- name: Allow names in list to be substrings of input read headers
- prefix=f
- Allow names to match read header prefixes. When true, read headers starting with any name in the list will match.
- case=t
- (casesensitive) Match case sensitivity. When true (default), matching is case-sensitive. Set to false for case-insensitive matching.
Name Processing Parameters
- ths=f
- (truncateheadersymbol) Ignore a leading @ or > symbol in the names file. Useful when the names file contains fasta/fastq headers.
- tws=f
- (truncatewhitespace) Ignore leading or trailing whitespace in the names file. Useful for names files with formatting variations.
- truncate=f
- Set both ths and tws at the same time. Convenient shorthand for cleaning up names from various file formats.
Positional Parameters
These parameters allow output of only a portion of matching sequences. Zero-based, inclusive coordinates. Intended for single sequence extraction with include=t mode.
- from=-1
- Only print bases starting at this position. Zero-based coordinate system. -1 means start from beginning.
- to=-1
- Only print bases up to this position. Zero-based, inclusive coordinate system. -1 means continue to end.
- range=
- Set from and to with a single flag using format "start-end". Example: range=100-500 sets from=100, to=500.
File I/O Parameters
- ow=t
- (overwrite) Overwrites files that already exist. Set to false to prevent accidental overwriting of existing output files.
- app=f
- (append) Append to files that already exist rather than overwriting them. Cannot be used with ow=t.
- zl=4
- (ziplevel) Set compression level for gzipped output files, from 1 (fastest, lowest compression) to 9 (slowest, highest compression).
- int=f
- (interleaved) Determines whether INPUT file is considered interleaved. When true, paired reads are expected in a single interleaved file.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 800m.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions for better performance in production environments.
Examples
Basic Filtering - Exclude Specific Reads
filterbyname.sh in=reads.fq out=filtered.fq names=unwanted1,unwanted2,unwanted3
Excludes reads with names exactly matching "unwanted1", "unwanted2", or "unwanted3" from the input file.
Include Only Specific Reads
filterbyname.sh in=reads.fq out=selected.fq names=target1,target2 include=t
Outputs only reads with names exactly matching "target1" or "target2".
Paired Read Filtering
filterbyname.sh in1=reads_R1.fq in2=reads_R2.fq out1=clean_R1.fq out2=clean_R2.fq names=contaminants.txt
Filters paired reads, excluding any pairs where either read matches names from contaminants.txt file.
Substring Matching
filterbyname.sh in=reads.fq out=filtered.fq names=contamination substring=name
Excludes reads whose headers contain "contamination" as a substring anywhere in the name.
Prefix Matching
filterbyname.sh in=reads.fq out=filtered.fq names=HWI-,HWUSI- prefix=t
Excludes reads whose names start with "HWI-" or "HWUSI-" prefixes.
Extract and Trim Specific Sequence
filterbyname.sh in=genome.fa out=gene.fa names=gene_of_interest include=t from=100 to=500
Extracts sequence named "gene_of_interest" and outputs only bases 100-500 (zero-based coordinates).
Case-Insensitive Filtering with Name Cleanup
filterbyname.sh in=reads.fq out=filtered.fq names=contamlist.txt case=f truncate=t
Performs case-insensitive filtering while cleaning up names (removing header symbols and whitespace) from the contamlist.txt file.
Algorithm Details
Matching Strategy
FilterReadsByName implements multiple matching strategies using LinkedHashSet for name storage and String methods for pattern matching:
- Exact Matching: Uses LinkedHashSet<String> names.contains() for O(1) lookup performance with hash-based exact name matching.
- Substring Matching: Implements for-each loop through names set using String.contains() method: headerSubstringOfName checks name.contains(header), nameSubstringOfHeader checks header.contains(name).
- Prefix Matching: Uses header.startsWith(name) for each name in the set, with TODO comment indicating potential trie optimization for large prefix lists.
- Case Sensitivity: When ignoreCase=true, converts all names to lowercase using String.toLowerCase() during preprocessing and applies same transformation to headers during comparison.
Header Processing
The tool implements header processing through IlluminaHeaderParser2 and ByteBuilder for coordinate extraction and name normalization:
- Coordinate Parsing: Uses IlluminaHeaderParser2.parse(r1.id) followed by ihp.appendCoordinates(bb.clear()) to extract Illumina sequencer coordinate information when coordinate=true.
- Prefix Extraction: Implements character-by-character scanning from position 1 through header.length(), detecting whitespace or "/1", "/2" patterns, plus "1:", "2:" patterns preceded by whitespace for paired read suffix identification.
- Name Normalization: Uses conditional string operations: truncateHeaderSymbol removes leading '@' or '>' via s.substring(1) when s.charAt(0) matches these symbols, trimWhitespace applies String.trim() for leading/trailing whitespace removal.
Memory Management
The implementation uses specific data structures and concurrent processing for memory efficiency:
- Streaming Processing: Uses ConcurrentReadInputStream.getReadInputStream() and ConcurrentReadOutputStream.getStream() with configurable buffer size (buff=4) for concurrent read/write operations with ListNum<Read> batch processing.
- Name Storage: Names are stored in LinkedHashSet<String> to maintain insertion order while providing O(1) contains() operations. Initial storage in ArrayList<String> names is converted to array via names.toArray(new String[names.size()]) for processing.
- Buffer Management: Uses Shared.capBuffers(4) to limit concurrent buffer allocation and ArrayList<Read> retain for filtered read accumulation per batch, with reads.size() capacity pre-allocation.
Performance Characteristics
- Time Complexity: O(1) for LinkedHashSet.contains() exact matching, O(n*m) for substring matching where n=names.size() and m=average header.length() due to String.contains() linear scan.
- Memory Usage: LinkedHashSet storage plus calcXmx() default allocation of 800MB (z="-Xmx800m") with freeRam 800m 84% calculation for buffer sizing.
- I/O Throughput: ReadWrite.USE_PIGZ=ReadWrite.USE_UNPIGZ=true enables parallel compression with ReadWrite.setZipThreads(Shared.threads()) for multithreaded file operations.
Input/Output Formats
Supported Input Formats
- FASTQ: Standard and compressed (.gz, .bz2) formats
- FASTA: Standard and compressed formats
- SAM: Sequence Alignment/Map format (headers preserved when possible)
- Interleaved: Paired reads in single interleaved file
Names File Formats
- Plain text: One name per line
- Comma-separated: Multiple names on command line
- FASTA/FASTQ: Extracts sequence names from headers
- SAM: Extracts read names from SAM records
Output Options
- Single-end: Filtered reads to single output file
- Paired-end: Maintains read pairing in separate files
- Interleaved: Outputs paired reads in interleaved format
- Compression: Automatic format detection and configurable compression levels
Common Use Cases
Quality Control
- Remove contaminating sequences by name
- Filter out low-quality reads identified by upstream tools
- Remove adapter or primer sequences with known names
Sequence Extraction
- Extract specific sequences of interest from large datasets
- Subset reads for detailed analysis or validation
- Create training/test datasets from larger collections
Data Cleaning
- Remove reads matching known contamination databases
- Filter reads from specific sequencing runs or lanes
- Exclude reads with problematic naming patterns
Tips and Troubleshooting
Performance Tips
- Use exact matching when possible for best performance
- For large names lists, consider using hash-based lookups rather than substring matching
- Enable compression for large output files to save disk space
- Increase memory allocation (-Xmx) for very large names lists
Common Issues
- Names not matching: Check for extra whitespace, case differences, or header symbols (@, >)
- Memory errors: Reduce batch size or increase -Xmx setting
- Slow performance: Avoid substring matching with very large names lists
- Paired read issues: Ensure both reads in a pair have the same prefix for proper matching
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org