FilterSubs

Script: filtersubs.sh Package: jgi Class: FilterReadsWithSubs.java

Filters a sam file to select only reads with substitution errors for bases with quality scores in a certain interval. Used for manually examining specific reads that may have incorrectly calibrated quality scores.

Basic Usage

filtersubs.sh in=<file> out=<file> minq=<number> maxq=<number>

FilterSubs is designed for quality control analysis of aligned reads, particularly for identifying reads with substitution errors within specific quality score ranges. This tool is especially useful for examining potential quality score calibration issues in sequencing data.

Parameters

FilterSubs parameters control the filtering criteria for selecting reads based on substitution errors, indels, and clipping operations within specified quality score ranges.

Input/Output Parameters

in=<file>
Input sam or bam file. The input must be an aligned file containing match strings (either from BBMap or another aligner that produces detailed match information).
out=<file>
Output file. Filtered reads will be written to this file in the same format as the input (SAM or BAM).

Quality Score Filtering

minq=0
Keep only reads with substitutions of at least this quality. Substituted bases must have quality scores >= minq to contribute to the passing substitution count.
maxq=99
Keep only reads with substitutions of at most this quality. Substituted bases must have quality scores <= maxq to contribute to the passing substitution count.

Error Type Parameters

countindels=t
Also keep reads with indels in the quality range. When true, reads containing insertion or deletion errors will also pass the filter, in addition to reads with substitution errors.
minsubs=1
Require at least this many substitutions. Reads must have at least this number of total substitutions (regardless of quality) to be considered for filtering.
keepperfect=f
Also keep error-free reads. When true, reads with no substitutions or indels will be retained in the output, regardless of other filtering criteria.

Clipping Parameters

minclips=0
Discard reads with fewer clip operations than this. Reads with fewer than minclips clipping operations will be filtered out.
maxclips=-1
If nonnegative, discard reads with more clip operations than this. When set to a non-negative value, reads with more than maxclips clipping operations will be filtered out. Default -1 means no upper limit.

Examples

Basic Quality Score Analysis

filtersubs.sh in=aligned.sam out=lowqual_errors.sam minq=10 maxq=20

Filter reads to find only those with substitution errors in bases having quality scores between 10-20. Useful for identifying potential quality calibration issues in this specific quality range.

High Quality Substitutions Only

filtersubs.sh in=mapped.bam out=highqual_subs.bam minq=30 maxq=40 minsubs=2

Find reads with at least 2 substitutions where the substituted bases have high quality scores (30-40). This can help identify systematic errors in high-confidence base calls.

Include Indels and Perfect Reads

filtersubs.sh in=input.sam out=comprehensive.sam minq=15 maxq=25 countindels=t keepperfect=t

Comprehensive filtering that includes reads with substitutions in the 15-25 quality range, reads with indels, and perfect reads with no errors.

Strict Clipping Filter

filtersubs.sh in=aligned.sam out=no_clips.sam minq=5 maxq=15 maxclips=0

Find reads with substitutions in low quality bases (5-15) but exclude any reads with clipping operations. Useful for analyzing unambiguous alignment regions.

Algorithm Details

Match String Processing

FilterSubs processes aligned reads by parsing the match string to identify different types of sequence differences:

Quality Score Analysis

The algorithm specifically examines quality scores of substituted bases to determine if they fall within the specified range (minq to maxq). This enables targeted analysis of potential quality calibration issues:

Filtering Logic

Reads are retained if they meet any of these conditions:

  1. Substitution criteria: At least minsubs total substitutions AND at least one substitution with quality in [minq, maxq] range
  2. Indel criteria: Contains indels (if countindels=true)
  3. Perfect read criteria: No substitutions or indels (if keepperfect=true)

Additionally, reads are rejected if clipping operations fall outside the [minclips, maxclips] range.

Memory Efficiency

The tool uses streaming I/O processing with concurrent read input/output streams, making it memory-efficient even for large SAM/BAM files. Default memory allocation is only 120MB (-Xmx120m), as the algorithm processes reads individually without loading entire datasets into memory.

Performance Characteristics

Processing speed depends primarily on:

The algorithm scales linearly with input file size and number of aligned reads.

Use Cases

Quality Control Analysis

FilterSubs is particularly valuable for:

Integration with Other Tools

FilterSubs output can be used with:

Technical Notes

Input Requirements

Output Format

Limitations

Support

For questions and support: