FilterSubs
Filters a sam file to select only reads with substitution errors for bases with quality scores in a certain interval. Used for manually examining specific reads that may have incorrectly calibrated quality scores.
Basic Usage
filtersubs.sh in=<file> out=<file> minq=<number> maxq=<number>
FilterSubs is designed for quality control analysis of aligned reads, particularly for identifying reads with substitution errors within specific quality score ranges. This tool is especially useful for examining potential quality score calibration issues in sequencing data.
Parameters
FilterSubs parameters control the filtering criteria for selecting reads based on substitution errors, indels, and clipping operations within specified quality score ranges.
Input/Output Parameters
- in=<file>
- Input sam or bam file. The input must be an aligned file containing match strings (either from BBMap or another aligner that produces detailed match information).
- out=<file>
- Output file. Filtered reads will be written to this file in the same format as the input (SAM or BAM).
Quality Score Filtering
- minq=0
- Keep only reads with substitutions of at least this quality. Substituted bases must have quality scores >= minq to contribute to the passing substitution count.
- maxq=99
- Keep only reads with substitutions of at most this quality. Substituted bases must have quality scores <= maxq to contribute to the passing substitution count.
Error Type Parameters
- countindels=t
- Also keep reads with indels in the quality range. When true, reads containing insertion or deletion errors will also pass the filter, in addition to reads with substitution errors.
- minsubs=1
- Require at least this many substitutions. Reads must have at least this number of total substitutions (regardless of quality) to be considered for filtering.
- keepperfect=f
- Also keep error-free reads. When true, reads with no substitutions or indels will be retained in the output, regardless of other filtering criteria.
Clipping Parameters
- minclips=0
- Discard reads with fewer clip operations than this. Reads with fewer than minclips clipping operations will be filtered out.
- maxclips=-1
- If nonnegative, discard reads with more clip operations than this. When set to a non-negative value, reads with more than maxclips clipping operations will be filtered out. Default -1 means no upper limit.
Examples
Basic Quality Score Analysis
filtersubs.sh in=aligned.sam out=lowqual_errors.sam minq=10 maxq=20
Filter reads to find only those with substitution errors in bases having quality scores between 10-20. Useful for identifying potential quality calibration issues in this specific quality range.
High Quality Substitutions Only
filtersubs.sh in=mapped.bam out=highqual_subs.bam minq=30 maxq=40 minsubs=2
Find reads with at least 2 substitutions where the substituted bases have high quality scores (30-40). This can help identify systematic errors in high-confidence base calls.
Include Indels and Perfect Reads
filtersubs.sh in=input.sam out=comprehensive.sam minq=15 maxq=25 countindels=t keepperfect=t
Comprehensive filtering that includes reads with substitutions in the 15-25 quality range, reads with indels, and perfect reads with no errors.
Strict Clipping Filter
filtersubs.sh in=aligned.sam out=no_clips.sam minq=5 maxq=15 maxclips=0
Find reads with substitutions in low quality bases (5-15) but exclude any reads with clipping operations. Useful for analyzing unambiguous alignment regions.
Algorithm Details
Match String Processing
FilterSubs processes aligned reads by parsing the match string to identify different types of sequence differences:
- Substitutions ('S'): Single base mismatches between read and reference
- Insertions ('I'): Extra bases in the read not present in reference
- Deletions ('D'): Bases missing from read that are present in reference
- Clipping ('C'): Soft-clipped bases at read ends
- Matches ('m'): Bases that align perfectly to reference
- N-regions ('N'): Alignment gaps due to splicing or unknown sequence
Quality Score Analysis
The algorithm specifically examines quality scores of substituted bases to determine if they fall within the specified range (minq to maxq). This enables targeted analysis of potential quality calibration issues:
- Counts total substitutions in each read regardless of quality
- Separately counts "passing substitutions" with quality scores in the target range
- Applies minimum substitution threshold before quality filtering
- Optionally includes indels in the filtering decision
Filtering Logic
Reads are retained if they meet any of these conditions:
- Substitution criteria: At least minsubs total substitutions AND at least one substitution with quality in [minq, maxq] range
- Indel criteria: Contains indels (if countindels=true)
- Perfect read criteria: No substitutions or indels (if keepperfect=true)
Additionally, reads are rejected if clipping operations fall outside the [minclips, maxclips] range.
Memory Efficiency
The tool uses streaming I/O processing with concurrent read input/output streams, making it memory-efficient even for large SAM/BAM files. Default memory allocation is only 120MB (-Xmx120m), as the algorithm processes reads individually without loading entire datasets into memory.
Performance Characteristics
Processing speed depends primarily on:
- File I/O speed (reading SAM/BAM and writing filtered output)
- Complexity of match strings (longer reads with more alignment details take longer to process)
- Fraction of reads meeting filtering criteria (affects output I/O time)
The algorithm scales linearly with input file size and number of aligned reads.
Use Cases
Quality Control Analysis
FilterSubs is particularly valuable for:
- Quality score calibration assessment: Identifying systematic errors in specific quality ranges
- Sequencer troubleshooting: Finding reads with unexpected error patterns
- Method comparison: Analyzing differences between alignment or base-calling approaches
- Training data preparation: Creating datasets for quality recalibration or error correction training
Integration with Other Tools
FilterSubs output can be used with:
- Visualization tools: IGV, Tablet, or other SAM viewers for manual inspection
- Statistics tools: BBMap stats.sh for quantitative analysis of filtered reads
- Quality recalibration: GATK BaseRecalibrator or similar tools
- Error correction: Tadpole or other BBTools error correction utilities
Technical Notes
Input Requirements
- Input must be SAM or BAM format with detailed match strings
- Reads must be aligned (unaligned reads will not be processed)
- Quality scores must be present in the input file
- BBMap alignment is recommended for optimal match string detail
Output Format
- Output format matches input format (SAM or BAM)
- All read information is preserved in filtered output
- Reads are written in streaming fashion (no sorting applied)
- Paired reads are handled correctly when present
Limitations
- Requires aligned input with match strings
- Cannot process FASTQ files directly
- Match string format must be compatible with BBTools conventions
- Quality-based filtering only applies to substitutions, not indels
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org