FilterSam
Filters a SAM file to remove reads with variations unsupported by other reads (bad vars, aka bad subs). Designed for improving data quality by filtering out reads containing likely sequencing errors or artifacts that are not supported by consensus evidence from other mapped reads.
Basic Usage
filtersam.sh in=<file> out=<file> vcf=<file>
FilterSam requires an input SAM/BAM file and a variant file (VCF format or CallVariants native format). For particularly bad data, it may be advisable to iteratively re-call variants and re-run FilterSam.
Parameters
Parameters control variant filtering thresholds, file handling, and processing options. The tool filters reads based on the number and support level of variants they contain.
Input/Output Parameters
- in=<file>
- Input SAM or BAM file containing aligned reads to be filtered.
- ref=<file>
- Optional fasta reference file. If not provided, scaffold information will be loaded from the SAM header.
- out=<file>
- Output file for good reads (reads that pass the variant filter).
- outb=<file>
- Output file for bad reads (reads that fail the variant filter).
- vcf=<file>
- VCF file of variants called from these reads. Used to determine which variants are supported by sufficient evidence.
- vars=<file>
- Alternatively, variants can be provided in CallVariants' native output format instead of VCF.
Variant Filtering Parameters
- mbv=2
- (maxbadvars) Discard reads with more bad variants than this threshold. Bad variants are those with insufficient support based on depth and allele fraction criteria.
- mbad=2
- (maxbadalleledepth) A variant is considered bad if the allele depth is at most this value. Works in conjunction with mbaf - the more stringent threshold applies.
- mbaf=0.01
- (maxbadallelefraction) A variant is considered bad if the allele fraction is at most this value. The more stringent of mbad or mbaf is used, so in low depth regions mbad is dominant while in high depth regions mbaf is more important. Variants are considered bad if they fail either threshold (meaning ad<=mbad or af<=mbaf).
- mbrd=2
- (minbadreaddepth) Substitutions may only be considered bad if the total read depth spanning the variant position is at least this much. Prevents filtering based on variants in very low-coverage regions.
- border=5
- (minenddist) Ignore variants within this distance of read ends. Read ends are more prone to sequencing errors and should be excluded from variant analysis.
Variant Type Selection
- sub=t
- Consider bad substitutions when filtering reads. Default is true.
- ins=f
- Consider bad insertions when filtering reads. Default is false.
- del=f
- Consider bad deletions when filtering reads. Default is false.
Processing Options
- verbose=f
- Print verbose status messages during processing.
- ordered=t
- Output reads in the same order as input. Set to false for faster processing if order doesn't matter.
- ss=t
- (streamer) Use SAM streamer for input processing. Generally provides better performance.
- ploidy=1
- Expected ploidy for variant calling if variants are called internally.
- prefilter=f
- Apply prefiltering during variant calling to improve performance.
Advanced Filtering Options
- maxsubs=-1
- (subfilter) Maximum allowed substitutions in a read. Reads exceeding this threshold are discarded regardless of variant support. -1 means no limit.
- maxvars=-1
- (varfilter) Maximum allowed total variations in a read. Reads exceeding this threshold are discarded regardless of variant support. -1 means no limit.
Examples
Basic Variant-Based Filtering
# Filter reads using an existing VCF file
filtersam.sh in=mapped.sam out=clean.sam vcf=variants.vcf
# Separate good and bad reads into different files
filtersam.sh in=mapped.bam out=good.bam outb=bad.bam vcf=variants.vcf
Basic usage with VCF file containing known variants. Reads with too many unsupported variants are filtered out.
Stringent Filtering Parameters
# More stringent filtering for high-quality data
filtersam.sh in=reads.sam out=filtered.sam vcf=vars.vcf \
mbv=1 mbad=3 mbaf=0.05 mbrd=5
# Include insertions and deletions in filtering
filtersam.sh in=reads.sam out=filtered.sam vcf=vars.vcf \
ins=t del=t border=10
Adjust filtering parameters for more stringent quality control, including indels.
Using CallVariants Native Format
# Use CallVariants output format instead of VCF
filtersam.sh in=mapped.sam out=clean.sam vars=variants.txt ref=reference.fa
Alternative using CallVariants' native output format with reference file.
Iterative Improvement Workflow
# Step 1: Call variants from original data
callvariants.sh in=mapped.sam ref=ref.fa out=vars1.vcf clearfilters minreads=2
# Step 2: Filter reads based on variants
filtersam.sh in=mapped.sam out=filtered1.sam vcf=vars1.vcf
# Step 3: Re-call variants from filtered data
callvariants.sh in=filtered1.sam ref=ref.fa out=vars2.vcf clearfilters minreads=2
# Step 4: Filter again with updated variants
filtersam.sh in=filtered1.sam out=final.sam vcf=vars2.vcf
Iterative approach for particularly noisy data, progressively improving variant calls and read quality.
Algorithm Details
Variant Support Analysis
FilterSam uses a dual-threshold approach to determine variant support quality. Each variant is evaluated against both allele depth (absolute count) and allele fraction (relative frequency) thresholds:
- Allele Depth Threshold: Variants with allele depth ≤ mbad are considered poorly supported
- Allele Fraction Threshold: Variants with allele fraction ≤ mbaf are considered poorly supported
- Combined Assessment: A variant fails if it meets either threshold (logical OR), making the system adaptive to both low and high-coverage regions
Read Filtering Process
The filtering algorithm processes each mapped read through multiple stages:
- Variant Counting: Count total variants in the read (substitutions, insertions, deletions based on settings)
- Quick Pass: Reads with zero variants or variants ≤ maxbadvars automatically pass
- Support Analysis: For reads exceeding maxbadvars, analyze each variant for support level using the dual-threshold approach
- End Distance Filtering: Ignore variants within 'border' bases of read ends to avoid end-effect artifacts
- Final Decision: Discard reads if unsupported variant count > maxbadvars threshold
Multi-threaded Processing
FilterSam uses parallel processing with the following architecture:
- Input Streaming: Concurrent read input using either ConcurrentReadInputStream or SamReadStreamer
- Worker Threads: Multiple ProcessThread instances handle read filtering independently
- Output Buffering: Separate output streams for good and bad reads with configurable buffer sizes
- Statistics Tracking: Per-thread counters for reads, bases, quality scores, mapping quality, and variant counts
Memory and Performance
The tool is designed for efficient processing of large SAM/BAM files:
- Default Memory: 8GB heap allocation (Xmx8g) with automatic adjustment based on available RAM
- Streaming Architecture: Processes reads in batches to minimize memory footprint
- Variant Map Loading: Variants are loaded into memory once and shared across all threads
- Scaffold Optimization: Reference scaffold information loaded from SAM header or reference file
Quality Metrics and Reporting
FilterSam provides comprehensive statistics including:
- Read and base processing rates (reads/sec, bases/sec)
- Retention vs. discard percentages
- Average quality scores for retained vs. discarded reads
- Average mapping quality comparison
- Average variant counts for filtered categories
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org