FilterByName

Script: filterbyname.sh Package: driver Class: FilterReadsByName.java

Filters reads by name using exact matching, substring matching, or prefix matching. Supports both inclusion and exclusion modes, paired reads, and positional trimming of output sequences.

Basic Usage

filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

Input files can be fasta, fastq, or sam format (optionally gzipped). Second input and output files are for paired reads and are optional. If input is paired and there is only one output file, it will be written interleaved.

Important: Leading > and @ symbols are NOT part of sequence names; they are part of the fasta, fastq, and sam specifications. Therefore, use names=e.coli_K12 (correct) rather than names=>e.coli_K12 or names=@e.coli_K12 (incorrect).

Parameters

Parameters are organized by their function in the filtering process. The tool supports multiple matching modes and output options for flexible read selection.

Core Parameters

include=f
Set to 'true' to include the filtered names rather than excluding them. When false (default), reads matching the names list are excluded from output.
names=
A list of strings or files containing names to match. Can be comma-separated strings, or files with one name per line, or standard read files (fasta, fastq, or sam). Multiple files and strings can be combined.
minlen=0
Do not output reads shorter than this length. Applied after any positional trimming.

Matching Parameters

substring=f
Allow one name to be a substring of the other, rather than requiring a full match.
  • f: No substring matching (exact match required)
  • t: Bidirectional substring matching
  • header: Allow input read headers to be substrings of names in list
  • name: Allow names in list to be substrings of input read headers
prefix=f
Allow names to match read header prefixes. When true, read headers starting with any name in the list will match.
case=t
(casesensitive) Match case sensitivity. When true (default), matching is case-sensitive. Set to false for case-insensitive matching.

Name Processing Parameters

ths=f
(truncateheadersymbol) Ignore a leading @ or > symbol in the names file. Useful when the names file contains fasta/fastq headers.
tws=f
(truncatewhitespace) Ignore leading or trailing whitespace in the names file. Useful for names files with formatting variations.
truncate=f
Set both ths and tws at the same time. Convenient shorthand for cleaning up names from various file formats.

Positional Parameters

These parameters allow output of only a portion of matching sequences. Zero-based, inclusive coordinates. Intended for single sequence extraction with include=t mode.

from=-1
Only print bases starting at this position. Zero-based coordinate system. -1 means start from beginning.
to=-1
Only print bases up to this position. Zero-based, inclusive coordinate system. -1 means continue to end.
range=
Set from and to with a single flag using format "start-end". Example: range=100-500 sets from=100, to=500.

File I/O Parameters

ow=t
(overwrite) Overwrites files that already exist. Set to false to prevent accidental overwriting of existing output files.
app=f
(append) Append to files that already exist rather than overwriting them. Cannot be used with ow=t.
zl=4
(ziplevel) Set compression level for gzipped output files, from 1 (fastest, lowest compression) to 9 (slowest, highest compression).
int=f
(interleaved) Determines whether INPUT file is considered interleaved. When true, paired reads are expected in a single interleaved file.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 800m.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions for better performance in production environments.

Examples

Basic Filtering - Exclude Specific Reads

filterbyname.sh in=reads.fq out=filtered.fq names=unwanted1,unwanted2,unwanted3

Excludes reads with names exactly matching "unwanted1", "unwanted2", or "unwanted3" from the input file.

Include Only Specific Reads

filterbyname.sh in=reads.fq out=selected.fq names=target1,target2 include=t

Outputs only reads with names exactly matching "target1" or "target2".

Paired Read Filtering

filterbyname.sh in1=reads_R1.fq in2=reads_R2.fq out1=clean_R1.fq out2=clean_R2.fq names=contaminants.txt

Filters paired reads, excluding any pairs where either read matches names from contaminants.txt file.

Substring Matching

filterbyname.sh in=reads.fq out=filtered.fq names=contamination substring=name

Excludes reads whose headers contain "contamination" as a substring anywhere in the name.

Prefix Matching

filterbyname.sh in=reads.fq out=filtered.fq names=HWI-,HWUSI- prefix=t

Excludes reads whose names start with "HWI-" or "HWUSI-" prefixes.

Extract and Trim Specific Sequence

filterbyname.sh in=genome.fa out=gene.fa names=gene_of_interest include=t from=100 to=500

Extracts sequence named "gene_of_interest" and outputs only bases 100-500 (zero-based coordinates).

Case-Insensitive Filtering with Name Cleanup

filterbyname.sh in=reads.fq out=filtered.fq names=contamlist.txt case=f truncate=t

Performs case-insensitive filtering while cleaning up names (removing header symbols and whitespace) from the contamlist.txt file.

Algorithm Details

Matching Strategy

FilterReadsByName implements multiple matching strategies using LinkedHashSet for name storage and String methods for pattern matching:

Header Processing

The tool implements header processing through IlluminaHeaderParser2 and ByteBuilder for coordinate extraction and name normalization:

Memory Management

The implementation uses specific data structures and concurrent processing for memory efficiency:

Performance Characteristics

Input/Output Formats

Supported Input Formats

Names File Formats

Output Options

Common Use Cases

Quality Control

Sequence Extraction

Data Cleaning

Tips and Troubleshooting

Performance Tips

Common Issues

Support

For questions and support: