FilterSam

Basic Usage

filtersam.sh in=<file> out=<file> vcf=<file>

FilterSam requires an input SAM/BAM file and a variant file (VCF format or CallVariants native format). For particularly bad data, it may be advisable to iteratively re-call variants and re-run FilterSam.

Parameters

Parameters control variant filtering thresholds, file handling, and processing options. The tool filters reads based on the number and support level of variants they contain.

Input/Output Parameters

in=<file>: Input SAM or BAM file containing aligned reads to be filtered.
ref=<file>: Optional fasta reference file. If not provided, scaffold information will be loaded from the SAM header.
out=<file>: Output file for good reads (reads that pass the variant filter).
outb=<file>: Output file for bad reads (reads that fail the variant filter).
vcf=<file>: VCF file of variants called from these reads. Used to determine which variants are supported by sufficient evidence.
vars=<file>: Alternatively, variants can be provided in CallVariants' native output format instead of VCF.

Variant Filtering Parameters

mbv=2: (maxbadvars) Discard reads with more bad variants than this threshold. Bad variants are those with insufficient support based on depth and allele fraction criteria.
mbad=2: (maxbadalleledepth) A variant is considered bad if the allele depth is at most this value. Works in conjunction with mbaf - the more stringent threshold applies.
mbaf=0.01: (maxbadallelefraction) A variant is considered bad if the allele fraction is at most this value. The more stringent of mbad or mbaf is used, so in low depth regions mbad is dominant while in high depth regions mbaf is more important. Variants are considered bad if they fail either threshold (meaning ad<=mbad or af<=mbaf).
mbrd=2: (minbadreaddepth) Substitutions may only be considered bad if the total read depth spanning the variant position is at least this much. Prevents filtering based on variants in very low-coverage regions.
border=5: (minenddist) Ignore variants within this distance of read ends. Read ends are more prone to sequencing errors and should be excluded from variant analysis.

Variant Type Selection

sub=t: Consider bad substitutions when filtering reads. Default is true.
ins=f: Consider bad insertions when filtering reads. Default is false.
del=f: Consider bad deletions when filtering reads. Default is false.

Processing Options

verbose=f: Print verbose status messages during processing.
ordered=t: Output reads in the same order as input. Set to false for faster processing if order doesn't matter.
ss=t: (streamer) Use SAM streamer for input processing. Generally provides better performance.
ploidy=1: Expected ploidy for variant calling if variants are called internally.
prefilter=f: Apply prefiltering during variant calling to improve performance.

Advanced Filtering Options

maxsubs=-1: (subfilter) Maximum allowed substitutions in a read. Reads exceeding this threshold are discarded regardless of variant support. -1 means no limit.
maxvars=-1: (varfilter) Maximum allowed total variations in a read. Reads exceeding this threshold are discarded regardless of variant support. -1 means no limit.

Examples

Basic Variant-Based Filtering

# Filter reads using an existing VCF file
filtersam.sh in=mapped.sam out=clean.sam vcf=variants.vcf

# Separate good and bad reads into different files
filtersam.sh in=mapped.bam out=good.bam outb=bad.bam vcf=variants.vcf

Basic usage with VCF file containing known variants. Reads with too many unsupported variants are filtered out.

Stringent Filtering Parameters

# More stringent filtering for high-quality data
filtersam.sh in=reads.sam out=filtered.sam vcf=vars.vcf \
    mbv=1 mbad=3 mbaf=0.05 mbrd=5

# Include insertions and deletions in filtering
filtersam.sh in=reads.sam out=filtered.sam vcf=vars.vcf \
    ins=t del=t border=10

Adjust filtering parameters for more stringent quality control, including indels.

Using CallVariants Native Format

# Use CallVariants output format instead of VCF
filtersam.sh in=mapped.sam out=clean.sam vars=variants.txt ref=reference.fa

Alternative using CallVariants' native output format with reference file.

Iterative Improvement Workflow

# Step 1: Call variants from original data
callvariants.sh in=mapped.sam ref=ref.fa out=vars1.vcf clearfilters minreads=2

# Step 2: Filter reads based on variants
filtersam.sh in=mapped.sam out=filtered1.sam vcf=vars1.vcf

# Step 3: Re-call variants from filtered data
callvariants.sh in=filtered1.sam ref=ref.fa out=vars2.vcf clearfilters minreads=2

# Step 4: Filter again with updated variants
filtersam.sh in=filtered1.sam out=final.sam vcf=vars2.vcf

Iterative approach for particularly noisy data, progressively improving variant calls and read quality.

Algorithm Details

Variant Support Analysis

FilterSam uses a dual-threshold approach to determine variant support quality. Each variant is evaluated against both allele depth (absolute count) and allele fraction (relative frequency) thresholds:

Allele Depth Threshold: Variants with allele depth ≤ mbad are considered poorly supported
Allele Fraction Threshold: Variants with allele fraction ≤ mbaf are considered poorly supported
Combined Assessment: A variant fails if it meets either threshold (logical OR), making the system adaptive to both low and high-coverage regions

Read Filtering Process

The filtering algorithm processes each mapped read through multiple stages:

Variant Counting: Count total variants in the read (substitutions, insertions, deletions based on settings)
Quick Pass: Reads with zero variants or variants ≤ maxbadvars automatically pass
Support Analysis: For reads exceeding maxbadvars, analyze each variant for support level using the dual-threshold approach
End Distance Filtering: Ignore variants within 'border' bases of read ends to avoid end-effect artifacts
Final Decision: Discard reads if unsupported variant count > maxbadvars threshold

Multi-threaded Processing

FilterSam uses parallel processing with the following architecture:

Input Streaming: Concurrent read input using either ConcurrentReadInputStream or SamReadStreamer
Worker Threads: Multiple ProcessThread instances handle read filtering independently
Output Buffering: Separate output streams for good and bad reads with configurable buffer sizes
Statistics Tracking: Per-thread counters for reads, bases, quality scores, mapping quality, and variant counts

Memory and Performance

The tool is designed for efficient processing of large SAM/BAM files:

Default Memory: 8GB heap allocation (Xmx8g) with automatic adjustment based on available RAM
Streaming Architecture: Processes reads in batches to minimize memory footprint
Variant Map Loading: Variants are loaded into memory once and shared across all threads
Scaffold Optimization: Reference scaffold information loaded from SAM header or reference file

Quality Metrics and Reporting

FilterSam provides comprehensive statistics including:

Read and base processing rates (reads/sec, bases/sec)
Retention vs. discard percentages
Average quality scores for retained vs. discarded reads
Average mapping quality comparison
Average variant counts for filtered categories

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org