FilterByName

Basic Usage

filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

Input files can be fasta, fastq, or sam format (optionally gzipped). Second input and output files are for paired reads and are optional. If input is paired and there is only one output file, it will be written interleaved.

Important: Leading > and @ symbols are NOT part of sequence names; they are part of the fasta, fastq, and sam specifications. Therefore, use names=e.coli_K12 (correct) rather than names=>e.coli_K12 or names=@e.coli_K12 (incorrect).

Parameters

Parameters are organized by their function in the filtering process. The tool supports multiple matching modes and output options for flexible read selection.

Core Parameters

include=f: Set to 'true' to include the filtered names rather than excluding them. When false (default), reads matching the names list are excluded from output.
names=: A list of strings or files containing names to match. Can be comma-separated strings, or files with one name per line, or standard read files (fasta, fastq, or sam). Multiple files and strings can be combined.
minlen=0: Do not output reads shorter than this length. Applied after any positional trimming.

Matching Parameters

substring=f

Allow one name to be a substring of the other, rather than requiring a full match.

f: No substring matching (exact match required)
t: Bidirectional substring matching
header: Allow input read headers to be substrings of names in list
name: Allow names in list to be substrings of input read headers

prefix=f

Allow names to match read header prefixes. When true, read headers starting with any name in the list will match.

case=t

(casesensitive) Match case sensitivity. When true (default), matching is case-sensitive. Set to false for case-insensitive matching.

Name Processing Parameters

ths=f: (truncateheadersymbol) Ignore a leading @ or > symbol in the names file. Useful when the names file contains fasta/fastq headers.
tws=f: (truncatewhitespace) Ignore leading or trailing whitespace in the names file. Useful for names files with formatting variations.
truncate=f: Set both ths and tws at the same time. Convenient shorthand for cleaning up names from various file formats.

Positional Parameters

These parameters allow output of only a portion of matching sequences. Zero-based, inclusive coordinates. Intended for single sequence extraction with include=t mode.

from=-1: Only print bases starting at this position. Zero-based coordinate system. -1 means start from beginning.
to=-1: Only print bases up to this position. Zero-based, inclusive coordinate system. -1 means continue to end.
range=: Set from and to with a single flag using format "start-end". Example: range=100-500 sets from=100, to=500.

File I/O Parameters

ow=t: (overwrite) Overwrites files that already exist. Set to false to prevent accidental overwriting of existing output files.
app=f: (append) Append to files that already exist rather than overwriting them. Cannot be used with ow=t.
zl=4: (ziplevel) Set compression level for gzipped output files, from 1 (fastest, lowest compression) to 9 (slowest, highest compression).
int=f: (interleaved) Determines whether INPUT file is considered interleaved. When true, paired reads are expected in a single interleaved file.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 800m.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions for better performance in production environments.

Examples

Basic Filtering - Exclude Specific Reads

filterbyname.sh in=reads.fq out=filtered.fq names=unwanted1,unwanted2,unwanted3

Excludes reads with names exactly matching "unwanted1", "unwanted2", or "unwanted3" from the input file.

Include Only Specific Reads

filterbyname.sh in=reads.fq out=selected.fq names=target1,target2 include=t

Outputs only reads with names exactly matching "target1" or "target2".

Paired Read Filtering

filterbyname.sh in1=reads_R1.fq in2=reads_R2.fq out1=clean_R1.fq out2=clean_R2.fq names=contaminants.txt

Filters paired reads, excluding any pairs where either read matches names from contaminants.txt file.

Substring Matching

filterbyname.sh in=reads.fq out=filtered.fq names=contamination substring=name

Excludes reads whose headers contain "contamination" as a substring anywhere in the name.

Prefix Matching

filterbyname.sh in=reads.fq out=filtered.fq names=HWI-,HWUSI- prefix=t

Excludes reads whose names start with "HWI-" or "HWUSI-" prefixes.

Extract and Trim Specific Sequence

filterbyname.sh in=genome.fa out=gene.fa names=gene_of_interest include=t from=100 to=500

Extracts sequence named "gene_of_interest" and outputs only bases 100-500 (zero-based coordinates).

Case-Insensitive Filtering with Name Cleanup

filterbyname.sh in=reads.fq out=filtered.fq names=contamlist.txt case=f truncate=t

Performs case-insensitive filtering while cleaning up names (removing header symbols and whitespace) from the contamlist.txt file.

Algorithm Details

Matching Strategy

FilterReadsByName implements multiple matching strategies using LinkedHashSet for name storage and String methods for pattern matching:

Exact Matching: Uses LinkedHashSet<String> names.contains() for O(1) lookup performance with hash-based exact name matching.
Substring Matching: Implements for-each loop through names set using String.contains() method: headerSubstringOfName checks name.contains(header), nameSubstringOfHeader checks header.contains(name).
Prefix Matching: Uses header.startsWith(name) for each name in the set, with TODO comment indicating potential trie optimization for large prefix lists.
Case Sensitivity: When ignoreCase=true, converts all names to lowercase using String.toLowerCase() during preprocessing and applies same transformation to headers during comparison.

Header Processing

The tool implements header processing through IlluminaHeaderParser2 and ByteBuilder for coordinate extraction and name normalization:

Coordinate Parsing: Uses IlluminaHeaderParser2.parse(r1.id) followed by ihp.appendCoordinates(bb.clear()) to extract Illumina sequencer coordinate information when coordinate=true.
Prefix Extraction: Implements character-by-character scanning from position 1 through header.length(), detecting whitespace or "/1", "/2" patterns, plus "1:", "2:" patterns preceded by whitespace for paired read suffix identification.
Name Normalization: Uses conditional string operations: truncateHeaderSymbol removes leading '@' or '>' via s.substring(1) when s.charAt(0) matches these symbols, trimWhitespace applies String.trim() for leading/trailing whitespace removal.

Memory Management

The implementation uses specific data structures and concurrent processing for memory efficiency:

Streaming Processing: Uses ConcurrentReadInputStream.getReadInputStream() and ConcurrentReadOutputStream.getStream() with configurable buffer size (buff=4) for concurrent read/write operations with ListNum<Read> batch processing.
Name Storage: Names are stored in LinkedHashSet<String> to maintain insertion order while providing O(1) contains() operations. Initial storage in ArrayList<String> names is converted to array via names.toArray(new String[names.size()]) for processing.
Buffer Management: Uses Shared.capBuffers(4) to limit concurrent buffer allocation and ArrayList<Read> retain for filtered read accumulation per batch, with reads.size() capacity pre-allocation.

Performance Characteristics

Time Complexity: O(1) for LinkedHashSet.contains() exact matching, O(n*m) for substring matching where n=names.size() and m=average header.length() due to String.contains() linear scan.
Memory Usage: LinkedHashSet storage plus calcXmx() default allocation of 800MB (z="-Xmx800m") with freeRam 800m 84% calculation for buffer sizing.
I/O Throughput: ReadWrite.USE_PIGZ=ReadWrite.USE_UNPIGZ=true enables parallel compression with ReadWrite.setZipThreads(Shared.threads()) for multithreaded file operations.

Input/Output Formats

Supported Input Formats

FASTQ: Standard and compressed (.gz, .bz2) formats
FASTA: Standard and compressed formats
SAM: Sequence Alignment/Map format (headers preserved when possible)
Interleaved: Paired reads in single interleaved file

Names File Formats

Plain text: One name per line
Comma-separated: Multiple names on command line
FASTA/FASTQ: Extracts sequence names from headers
SAM: Extracts read names from SAM records

Output Options

Single-end: Filtered reads to single output file
Paired-end: Maintains read pairing in separate files
Interleaved: Outputs paired reads in interleaved format
Compression: Automatic format detection and configurable compression levels

Common Use Cases

Quality Control

Remove contaminating sequences by name
Filter out low-quality reads identified by upstream tools
Remove adapter or primer sequences with known names

Sequence Extraction

Extract specific sequences of interest from large datasets
Subset reads for detailed analysis or validation
Create training/test datasets from larger collections

Data Cleaning

Remove reads matching known contamination databases
Filter reads from specific sequencing runs or lanes
Exclude reads with problematic naming patterns

Tips and Troubleshooting

Performance Tips

Use exact matching when possible for best performance
For large names lists, consider using hash-based lookups rather than substring matching
Enable compression for large output files to save disk space
Increase memory allocation (-Xmx) for very large names lists

Common Issues

Names not matching: Check for extra whitespace, case differences, or header symbols (@, >)
Memory errors: Reduce batch size or increase -Xmx setting
Slow performance: Avoid substring matching with very large names lists
Paired read issues: Ensure both reads in a pair have the same prefix for proper matching

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org