FilterVCF

Script: filtervcf.sh Package: var2 Class: FilterVCF.java

Filters VCF files by position or other attributes. Filtering by optional fields (such as allele frequency) require VCF files generated by CallVariants.

Basic Usage

filtervcf.sh in=<file> out=<file>

FilterVCF processes VCF files to remove variants that don't meet specified quality, position, or type criteria. It can operate in single-threaded or multithreaded mode for better performance on large files.

Parameters

Parameters are organized by their filtering function. The tool uses both VarFilter (variant-specific criteria) and SamFilter (position-specific criteria) systems for comprehensive variant evaluation.

I/O parameters

in=<file>
Input VCF file. Required parameter for variant processing.
out=<file>
Output VCF file. If not specified, output goes to stdout.
ref=<file>
Reference fasta file (optional). Used for variant validation and coordinate resolution through ScafMap.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true.
bgzip=f
Use bgzip for gzip compression instead of standard gzip. Provides better performance for large files. Default: false.
splitalleles=f
Split multi-allelic lines into multiple lines, with each line containing a single alternate allele. Default: false.
splitsubs=f
Split multi-base substitutions into individual SNPs. Useful for downstream analysis requiring single-base variants. Default: false.
canonize=t
Trim variations down to a canonical representation by removing redundant bases. Default: true.

Position-filtering parameters

minpos=
Ignore variants not overlapping this range. Coordinate-based filtering using the SamFilter system.
maxpos=
Ignore variants not overlapping this range. Works together with minpos to define a coordinate window.
contigs=
Comma-delimited list of contig names to include. These should have no spaces, or underscores instead of spaces. Uses ScafMap for coordinate resolution.
invert=f
Invert position filters. When true, keeps variants outside the specified position range instead of inside. Default: false.

Type-filtering parameters

sub=t
Keep substitutions (SNPs). Controls the Var.CALL_SUB global flag. Default: true.
del=t
Keep deletions. Controls the Var.CALL_DEL global flag for deletion variant types. Default: true.
ins=t
Keep insertions. Controls the Var.CALL_INS global flag for insertion variant types. Default: true.

Variant-quality filtering parameters

minreads=0
Ignore variants seen in fewer reads. Minimum coverage threshold for variant support. Default: 0.
minqualitymax=0
Ignore variants with lower max base quality. Filters based on the highest quality base supporting the variant. Default: 0.
minedistmax=0
Ignore variants with lower max distance from read ends. Filters variants too close to read termini, which are error-prone. Default: 0.
minmapqmax=0
Ignore variants with lower max mapq. Filters based on mapping quality of supporting reads. Default: 0.
minidmax=0
Ignore variants with lower max read identity. Filters variants in reads with poor overall identity to reference. Default: 0.
minpairingrate=0.0
Ignore variants with lower pairing rate. Requires proper pair information from alignment. Range: 0.0-1.0. Default: 0.0.
minstrandratio=0.0
Ignore variants with lower plus/minus strand ratio. Filters variants with strand bias, indicating potential artifacts. Default: 0.0.
minquality=0.0
Ignore variants with lower average base quality. Uses mean quality of all bases supporting the variant. Default: 0.0.
minedist=0.0
Ignore variants with lower average distance from ends. Mean distance from read termini for variant-supporting bases. Default: 0.0.
minavgmapq=0.0
Ignore variants with lower average mapq. Mean mapping quality across all reads supporting the variant. Default: 0.0.
minallelefraction=0.0
Ignore variants with lower allele fraction. This should be adjusted for high ploidies (e.g., 0.1 for diploid, 0.05 for tetraploid). Default: 0.0.
minid=0
Ignore variants with lower average read identity. Mean identity score across supporting reads. Default: 0.
minscore=0.0
Ignore variants with lower Phred-scaled score. Simple quality threshold using VCF QUAL field. Default: 0.0.
clearfilters
Reset all variant filters to zero. Clears both VarFilter and SamFilter criteria to start fresh.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions. Can provide minor performance improvements in production environments.

Examples

Basic Quality Filtering

filtervcf.sh in=variants.vcf out=filtered.vcf minscore=20.0 minallelefraction=0.1

Filter variants requiring minimum quality score of 20 and allele fraction of at least 10%. Suitable for diploid organisms.

Position-based Filtering

filtervcf.sh in=variants.vcf out=region.vcf minpos=1000000 maxpos=2000000 contigs=chr1,chr2

Extract variants from chromosomes 1 and 2 within the coordinate range 1Mb to 2Mb.

Type-specific Filtering

filtervcf.sh in=variants.vcf out=snps_only.vcf del=f ins=f sub=t

Keep only substitutions (SNPs), filtering out insertions and deletions.

Comprehensive Quality Control

filtervcf.sh in=variants.vcf out=highqual.vcf minreads=5 minquality=25.0 minstrandratio=0.2 minedist=10.0 minallelefraction=0.15

Stringent filtering requiring: minimum 5 supporting reads, average quality ≥25, strand ratio ≥0.2, distance from ends ≥10bp, and allele fraction ≥15%.

Multi-allelic Splitting

filtervcf.sh in=variants.vcf out=split.vcf splitalleles=t splitsubs=t canonize=t

Split multi-allelic variants and complex substitutions into simpler canonical forms for downstream analysis.

Algorithm Details

Filtering Architecture

FilterVCF implements a comprehensive dual-filter system combining statistical variant assessment with positional constraints:

Processing Strategy

The tool uses different optimization strategies based on threading mode:

Variant Splitting

The tool supports three types of variant decomposition:

Performance Characteristics

Coordinate System

FilterVCF uses the ScafMap system for efficient coordinate handling:

Support

For questions and support: