CompareVCF
Performs set operations on VCF files: Union, intersection, and subtraction.
Basic Usage
comparevcf.sh in=<file,file,...> out=<file>
CompareVCF requires at least two input VCF files to perform set operations. By default, it performs subtraction (first file minus all others).
Parameters
Parameters are organized by their function in the VCF comparison process.
I/O parameters
- in=<file>
- Input VCF files; must be at least 2 files, comma-separated. Files can be gzip-compressed.
- out=<file>
- Output file for the result of the set operation. Will be in VCF format unless otherwise specified.
- ref=<file>
- Reference file; optional. Usually not needed as VCF files contain reference information in headers.
- shist=<file>
- (scorehist) Output file for variant score histogram. Useful for analyzing quality score distributions.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite existing output files.
- bgzip=f
- Use bgzip for gzip compression instead of standard gzip. Bgzip allows for indexing with tabix.
Mode Parameters (choose one only)
- subtract=t
- Subtract all other files from the first file. This is the default mode. Variants present in the first file but not in any subsequent files are retained.
- union=f
- Make a union of all files. All unique variants from all input files are included in the output.
- intersection=f
- Make an intersection of all files. Only variants present in ALL input files are retained in the output.
Processing Parameters
- addsamples=t
- Include all samples in the output lines. When enabled, sample information from all input files is preserved in the output.
- splitalleles=f
- Split multi-allelic lines into multiple lines. Each alternate allele gets its own line in the output.
- splitsubs=f
- Split multi-base substitutions into single nucleotide polymorphisms (SNPs). Complex substitutions are broken down into individual base changes.
- splitcomplex=f
- Split complex variations into simpler components. This includes insertions, deletions, and complex rearrangements.
- sass=f
- (split) Equivalent to setting both splitalleles=t and splitsubs=t. Convenience parameter for common splitting operations.
- splitall=f
- (sascsss) Equivalent to setting splitalleles=t, splitsubs=t, and splitcomplex=t. Splits all variant types into their simplest forms.
- canonize=t
- Trim variations down to a canonical representation. Removes redundant bases from insertions and deletions to create minimal representations.
- minscore=-99999
- (minqual, minq) Minimum quality score for variants to be included in the analysis. Variants with quality scores below this threshold are filtered out.
- lines=<number>
- Maximum number of lines to process. Set to -1 or omit for no limit. Useful for testing with large files.
- verbose=f
- Enable verbose output for debugging and monitoring progress through large datasets.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions for slightly better performance in production environments.
Examples
Basic Subtraction (Default Mode)
comparevcf.sh in=sample1.vcf,sample2.vcf,sample3.vcf out=unique_to_sample1.vcf
Find variants that are unique to sample1.vcf (present in sample1 but not in sample2 or sample3).
Union Operation
comparevcf.sh in=cohort1.vcf,cohort2.vcf,cohort3.vcf out=all_variants.vcf union=t
Combine all unique variants from three cohort files into a single variant set.
Intersection Operation
comparevcf.sh in=cancer.vcf,normal.vcf,germline.vcf out=shared_variants.vcf intersection=t
Find variants that are present in all three input files (cancer, normal, and germline samples).
Quality Filtering with Splitting
comparevcf.sh in=raw1.vcf,raw2.vcf out=high_quality.vcf union=t minscore=30 splitall=t canonize=t
Create a union of high-quality variants (Q≥30), splitting complex variants into simple forms and canonizing the representation.
With Score Histogram Output
comparevcf.sh in=variants1.vcf,variants2.vcf out=difference.vcf shist=quality_distribution.txt
Perform subtraction while also generating a histogram of variant quality scores for analysis.
Algorithm Details
Set Operation Implementation
CompareVCF uses Java HashSet data structures to perform set operations on VCF variants. Each VCF file is parsed into VCFLine objects that serve as the atomic units for comparison. The implementation uses three separate methods: union() iteratively adds variants from each file to a single HashSet, intersection() uses HashSet.retainAll() to keep only variants present in all files, and difference() uses HashSet.removeAll() to subtract variants from subsequent files.
Variant Processing Pipeline
The algorithm follows this processing pipeline:
- VCF Parsing: Each input file is parsed using the VCFFile class, which handles header parsing, sample name extraction, and variant line processing.
- Variant Splitting: If splitting options are enabled, multi-allelic variants and complex variations are decomposed into simpler forms using the VCFLine.split() method.
- Quality Filtering: Variants below the minimum score threshold are filtered out during the loading phase.
- Set Operations: The core operations are implemented as:
- Union: Iterative addition of all variants from each file to a single HashSet
- Intersection: Uses HashSet.retainAll() to keep only variants present in all files
- Difference: Uses HashSet.removeAll() to subtract variants from subsequent files
- Canonicalization: If enabled, variants are trimmed to their minimal representation by removing redundant reference bases.
- Output Generation: Results are sorted and written in VCF format, preserving header information and sample data.
Memory and Performance Characteristics
CompareVCF loads all variants into memory as HashSet collections, making it very fast for set operations but requiring sufficient RAM for large variant datasets. Memory usage scales linearly with the total number of unique variants across all input files. The HashSet implementation provides O(1) average-case complexity for lookups, insertions, and deletions, making set operations highly efficient even with large datasets.
Variant Comparison Strategy
VCF variants are compared based on their genomic position (chromosome and position) and allelic content. The VCFLine class implements proper equals() and hashCode() methods to ensure variants are correctly identified as identical or different. When canonicalization is enabled, variants are normalized to their minimal representation before comparison, ensuring that equivalent variants with different representations (e.g., "ATCG→A" vs "TCG→" for the same deletion) are recognized as identical.
Header and Sample Management
The tool preserves VCF header information from the first input file and can optionally combine sample information from all input files when addsamples=t. This allows for proper tracking of variant calls across multiple samples in the output file.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org