FilterVCF
Filters VCF files by position or other attributes. Filtering by optional fields (such as allele frequency) require VCF files generated by CallVariants.
Basic Usage
filtervcf.sh in=<file> out=<file>
FilterVCF processes VCF files to remove variants that don't meet specified quality, position, or type criteria. It can operate in single-threaded or multithreaded mode for better performance on large files.
Parameters
Parameters are organized by their filtering function. The tool uses both VarFilter (variant-specific criteria) and SamFilter (position-specific criteria) systems for comprehensive variant evaluation.
I/O parameters
- in=<file>
- Input VCF file. Required parameter for variant processing.
- out=<file>
- Output VCF file. If not specified, output goes to stdout.
- ref=<file>
- Reference fasta file (optional). Used for variant validation and coordinate resolution through ScafMap.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true.
- bgzip=f
- Use bgzip for gzip compression instead of standard gzip. Provides better performance for large files. Default: false.
- splitalleles=f
- Split multi-allelic lines into multiple lines, with each line containing a single alternate allele. Default: false.
- splitsubs=f
- Split multi-base substitutions into individual SNPs. Useful for downstream analysis requiring single-base variants. Default: false.
- canonize=t
- Trim variations down to a canonical representation by removing redundant bases. Default: true.
Position-filtering parameters
- minpos=
- Ignore variants not overlapping this range. Coordinate-based filtering using the SamFilter system.
- maxpos=
- Ignore variants not overlapping this range. Works together with minpos to define a coordinate window.
- contigs=
- Comma-delimited list of contig names to include. These should have no spaces, or underscores instead of spaces. Uses ScafMap for coordinate resolution.
- invert=f
- Invert position filters. When true, keeps variants outside the specified position range instead of inside. Default: false.
Type-filtering parameters
- sub=t
- Keep substitutions (SNPs). Controls the Var.CALL_SUB global flag. Default: true.
- del=t
- Keep deletions. Controls the Var.CALL_DEL global flag for deletion variant types. Default: true.
- ins=t
- Keep insertions. Controls the Var.CALL_INS global flag for insertion variant types. Default: true.
Variant-quality filtering parameters
- minreads=0
- Ignore variants seen in fewer reads. Minimum coverage threshold for variant support. Default: 0.
- minqualitymax=0
- Ignore variants with lower max base quality. Filters based on the highest quality base supporting the variant. Default: 0.
- minedistmax=0
- Ignore variants with lower max distance from read ends. Filters variants too close to read termini, which are error-prone. Default: 0.
- minmapqmax=0
- Ignore variants with lower max mapq. Filters based on mapping quality of supporting reads. Default: 0.
- minidmax=0
- Ignore variants with lower max read identity. Filters variants in reads with poor overall identity to reference. Default: 0.
- minpairingrate=0.0
- Ignore variants with lower pairing rate. Requires proper pair information from alignment. Range: 0.0-1.0. Default: 0.0.
- minstrandratio=0.0
- Ignore variants with lower plus/minus strand ratio. Filters variants with strand bias, indicating potential artifacts. Default: 0.0.
- minquality=0.0
- Ignore variants with lower average base quality. Uses mean quality of all bases supporting the variant. Default: 0.0.
- minedist=0.0
- Ignore variants with lower average distance from ends. Mean distance from read termini for variant-supporting bases. Default: 0.0.
- minavgmapq=0.0
- Ignore variants with lower average mapq. Mean mapping quality across all reads supporting the variant. Default: 0.0.
- minallelefraction=0.0
- Ignore variants with lower allele fraction. This should be adjusted for high ploidies (e.g., 0.1 for diploid, 0.05 for tetraploid). Default: 0.0.
- minid=0
- Ignore variants with lower average read identity. Mean identity score across supporting reads. Default: 0.
- minscore=0.0
- Ignore variants with lower Phred-scaled score. Simple quality threshold using VCF QUAL field. Default: 0.0.
- clearfilters
- Reset all variant filters to zero. Clears both VarFilter and SamFilter criteria to start fresh.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions. Can provide minor performance improvements in production environments.
Examples
Basic Quality Filtering
filtervcf.sh in=variants.vcf out=filtered.vcf minscore=20.0 minallelefraction=0.1
Filter variants requiring minimum quality score of 20 and allele fraction of at least 10%. Suitable for diploid organisms.
Position-based Filtering
filtervcf.sh in=variants.vcf out=region.vcf minpos=1000000 maxpos=2000000 contigs=chr1,chr2
Extract variants from chromosomes 1 and 2 within the coordinate range 1Mb to 2Mb.
Type-specific Filtering
filtervcf.sh in=variants.vcf out=snps_only.vcf del=f ins=f sub=t
Keep only substitutions (SNPs), filtering out insertions and deletions.
Comprehensive Quality Control
filtervcf.sh in=variants.vcf out=highqual.vcf minreads=5 minquality=25.0 minstrandratio=0.2 minedist=10.0 minallelefraction=0.15
Stringent filtering requiring: minimum 5 supporting reads, average quality ≥25, strand ratio ≥0.2, distance from ends ≥10bp, and allele fraction ≥15%.
Multi-allelic Splitting
filtervcf.sh in=variants.vcf out=split.vcf splitalleles=t splitsubs=t canonize=t
Split multi-allelic variants and complex substitutions into simpler canonical forms for downstream analysis.
Algorithm Details
Filtering Architecture
FilterVCF implements a comprehensive dual-filter system combining statistical variant assessment with positional constraints:
- VarFilter System: Evaluates variant-specific statistical criteria including coverage depth, base quality distributions, strand bias, mapping quality, and allele frequencies. Works with Var objects converted from VCF format for detailed statistical analysis.
- SamFilter System: Handles coordinate-based filtering using ScafMap for efficient coordinate resolution. Supports range queries, contig selection, and coordinate inversion.
Processing Strategy
The tool uses different optimization strategies based on threading mode:
- Single-threaded Mode: Uses VCFLine.toVar() for comprehensive variant parsing and validation. More thorough but slower parsing suitable for detailed analysis.
- Multithreaded Mode: Uses VcfToVar.fromVCF() for faster parsing optimized for parallel processing. Each thread maintains independent statistics that are aggregated at completion.
Variant Splitting
The tool supports three types of variant decomposition:
- Allele Splitting: Multi-allelic variants (e.g., A→G,T) are split into separate biallelic records (A→G and A→T), preserving FORMAT field data appropriately.
- Substitution Splitting: Complex substitutions spanning multiple bases are decomposed into constituent SNPs when possible.
- Complex Splitting: More sophisticated decomposition of complex variants into simpler canonical forms.
Performance Characteristics
- Memory Usage: Moderate memory footprint with configurable Java heap. Default 4GB handles most datasets efficiently.
- Threading: Up to 8 threads maximum for parallel processing. Thread-safe operation with independent thread-local statistics.
- I/O Optimization: Uses ByteFile system with configurable compression (standard gzip or bgzip). Maintains ordered output even in multithreaded mode.
- Scalability: Processes large VCF files (millions of variants) efficiently with streaming architecture that doesn't load entire files into memory.
Coordinate System
FilterVCF uses the ScafMap system for efficient coordinate handling:
- Automatically extracts contig information from VCF headers (##contig lines)
- Supports reference genome loading for coordinate validation
- Handles both indexed and unindexed coordinate queries
- Compatible with standard VCF coordinate conventions (1-based, inclusive)
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org