FilterSam

Script: filtersam.sh Package: var2 Class: FilterSam.java

Filters a SAM file to remove reads with variations unsupported by other reads (bad vars, aka bad subs). Designed for improving data quality by filtering out reads containing likely sequencing errors or artifacts that are not supported by consensus evidence from other mapped reads.

Basic Usage

filtersam.sh in=<file> out=<file> vcf=<file>

FilterSam requires an input SAM/BAM file and a variant file (VCF format or CallVariants native format). For particularly bad data, it may be advisable to iteratively re-call variants and re-run FilterSam.

Parameters

Parameters control variant filtering thresholds, file handling, and processing options. The tool filters reads based on the number and support level of variants they contain.

Input/Output Parameters

in=<file>
Input SAM or BAM file containing aligned reads to be filtered.
ref=<file>
Optional fasta reference file. If not provided, scaffold information will be loaded from the SAM header.
out=<file>
Output file for good reads (reads that pass the variant filter).
outb=<file>
Output file for bad reads (reads that fail the variant filter).
vcf=<file>
VCF file of variants called from these reads. Used to determine which variants are supported by sufficient evidence.
vars=<file>
Alternatively, variants can be provided in CallVariants' native output format instead of VCF.

Variant Filtering Parameters

mbv=2
(maxbadvars) Discard reads with more bad variants than this threshold. Bad variants are those with insufficient support based on depth and allele fraction criteria.
mbad=2
(maxbadalleledepth) A variant is considered bad if the allele depth is at most this value. Works in conjunction with mbaf - the more stringent threshold applies.
mbaf=0.01
(maxbadallelefraction) A variant is considered bad if the allele fraction is at most this value. The more stringent of mbad or mbaf is used, so in low depth regions mbad is dominant while in high depth regions mbaf is more important. Variants are considered bad if they fail either threshold (meaning ad<=mbad or af<=mbaf).
mbrd=2
(minbadreaddepth) Substitutions may only be considered bad if the total read depth spanning the variant position is at least this much. Prevents filtering based on variants in very low-coverage regions.
border=5
(minenddist) Ignore variants within this distance of read ends. Read ends are more prone to sequencing errors and should be excluded from variant analysis.

Variant Type Selection

sub=t
Consider bad substitutions when filtering reads. Default is true.
ins=f
Consider bad insertions when filtering reads. Default is false.
del=f
Consider bad deletions when filtering reads. Default is false.

Processing Options

verbose=f
Print verbose status messages during processing.
ordered=t
Output reads in the same order as input. Set to false for faster processing if order doesn't matter.
ss=t
(streamer) Use SAM streamer for input processing. Generally provides better performance.
ploidy=1
Expected ploidy for variant calling if variants are called internally.
prefilter=f
Apply prefiltering during variant calling to improve performance.

Advanced Filtering Options

maxsubs=-1
(subfilter) Maximum allowed substitutions in a read. Reads exceeding this threshold are discarded regardless of variant support. -1 means no limit.
maxvars=-1
(varfilter) Maximum allowed total variations in a read. Reads exceeding this threshold are discarded regardless of variant support. -1 means no limit.

Examples

Basic Variant-Based Filtering

# Filter reads using an existing VCF file
filtersam.sh in=mapped.sam out=clean.sam vcf=variants.vcf

# Separate good and bad reads into different files
filtersam.sh in=mapped.bam out=good.bam outb=bad.bam vcf=variants.vcf

Basic usage with VCF file containing known variants. Reads with too many unsupported variants are filtered out.

Stringent Filtering Parameters

# More stringent filtering for high-quality data
filtersam.sh in=reads.sam out=filtered.sam vcf=vars.vcf \
    mbv=1 mbad=3 mbaf=0.05 mbrd=5

# Include insertions and deletions in filtering
filtersam.sh in=reads.sam out=filtered.sam vcf=vars.vcf \
    ins=t del=t border=10

Adjust filtering parameters for more stringent quality control, including indels.

Using CallVariants Native Format

# Use CallVariants output format instead of VCF
filtersam.sh in=mapped.sam out=clean.sam vars=variants.txt ref=reference.fa

Alternative using CallVariants' native output format with reference file.

Iterative Improvement Workflow

# Step 1: Call variants from original data
callvariants.sh in=mapped.sam ref=ref.fa out=vars1.vcf clearfilters minreads=2

# Step 2: Filter reads based on variants
filtersam.sh in=mapped.sam out=filtered1.sam vcf=vars1.vcf

# Step 3: Re-call variants from filtered data
callvariants.sh in=filtered1.sam ref=ref.fa out=vars2.vcf clearfilters minreads=2

# Step 4: Filter again with updated variants
filtersam.sh in=filtered1.sam out=final.sam vcf=vars2.vcf

Iterative approach for particularly noisy data, progressively improving variant calls and read quality.

Algorithm Details

Variant Support Analysis

FilterSam uses a dual-threshold approach to determine variant support quality. Each variant is evaluated against both allele depth (absolute count) and allele fraction (relative frequency) thresholds:

Read Filtering Process

The filtering algorithm processes each mapped read through multiple stages:

  1. Variant Counting: Count total variants in the read (substitutions, insertions, deletions based on settings)
  2. Quick Pass: Reads with zero variants or variants ≤ maxbadvars automatically pass
  3. Support Analysis: For reads exceeding maxbadvars, analyze each variant for support level using the dual-threshold approach
  4. End Distance Filtering: Ignore variants within 'border' bases of read ends to avoid end-effect artifacts
  5. Final Decision: Discard reads if unsupported variant count > maxbadvars threshold

Multi-threaded Processing

FilterSam uses parallel processing with the following architecture:

Memory and Performance

The tool is designed for efficient processing of large SAM/BAM files:

Quality Metrics and Reporting

FilterSam provides comprehensive statistics including:

Support

For questions and support: