ApplyVariants

Script: applyvariants.sh Package: var2 Class: ApplyVariants.java

Mutates a reference by applying a set of variants. When 2 variants overlap, the one with the higher allele count is used.

Basic Usage

applyvariants.sh in=<input file> vcf=<vcf file> out=<output file>

ApplyVariants processes reference sequences (FASTA format) and applies variants from a VCF file to generate mutated sequences. The tool is particularly useful for creating consensus sequences or strain-specific references by applying known variants to a reference genome.

Parameters

Parameters are organized into functional groups for processing variants, handling coverage data, renaming sequences, and controlling Java execution settings.

Standard parameters

in=<file>
Reference fasta input file containing the sequences to be mutated.
vcf=<file>
VCF file containing variants to apply to the reference. Must be properly formatted with chromosome/scaffold names matching the reference.
basecov=<file>
Optional per-base coverage file from BBMap or Pileup. Used with mincov parameter to mask low-coverage regions. Format should match the reference sequence names.
out=<file>
Output fasta file containing the mutated sequences with variants applied.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default is false, requiring explicit confirmation to overwrite.
ziplevel=2
(zl) Set compression level from 1 (lowest/fastest) through 9 (maximum/slowest) to change compression level for output files; lower compression is faster. Default is 2.

Processing parameters

mincov=0
Minimum coverage threshold. If positive and depth is below this value, change reference bases to N. Requires a coverage file specified with basecov parameter. Useful for masking unreliable low-coverage regions.
maxindel=-1
Maximum indel length to process. If positive, ignore indels longer than this value. Set to -1 (default) to process all indels regardless of size. Useful for filtering out very large structural variants.
noframeshifts=f
Filter out frameshifting indels. Set to true to ignore indels that are not a multiple of 3 in length, preserving reading frame in coding sequences. Useful for protein-coding regions.

Renaming parameters

name=
Optional new name for output sequences. If specified, all sequences will be renamed using this base name. Can be combined with addnumbers and prefix options.
addnumbers=f
Add sequential numbers (_1, _2, etc.) to ensure sequence names are unique when renaming. Default is false. Set to true when using name parameter with multiple sequences.
prefix=t
Use the name parameter as a prefix to the original name, instead of completely replacing the original name. Default is true. When false, completely replaces existing names.
delimiter=_
Symbol to place between parts of the new name when using prefix mode. Default is underscore (_). For space or tab, use the literal words "space" or "tab".

Java Parameters

-Xmx
Set Java's maximum memory usage, overriding autodetection. Examples: -Xmx20g specifies 20 gigabytes of RAM, -Xmx200m specifies 200 megabytes. The maximum is typically 85% of physical memory.
-eoom
Exit on out-of-memory. This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92 or later.
-da
Disable Java assertions. Can provide a small performance improvement in production use.

Examples

Basic Variant Application

applyvariants.sh in=reference.fasta vcf=variants.vcf out=mutated.fasta

Apply all variants from the VCF file to the reference genome, creating a consensus sequence.

Coverage-Filtered Variant Application

applyvariants.sh in=reference.fasta vcf=variants.vcf basecov=coverage.txt mincov=10 out=filtered_consensus.fasta

Apply variants but mask regions with coverage below 10x as N bases, ensuring only well-supported regions are included in the consensus.

Coding Sequence Variant Application

applyvariants.sh in=cds.fasta vcf=coding_variants.vcf noframeshifts=t maxindel=50 out=mutated_cds.fasta

Apply variants to coding sequences while filtering out frameshift indels and very large indels (>50bp) to maintain protein-coding integrity.

Renamed Output Sequences

applyvariants.sh in=reference.fasta vcf=strain_variants.vcf name=strain_X addnumbers=t out=strain_sequences.fasta

Apply variants and rename all output sequences with "strain_X" prefix, adding sequential numbers for uniqueness.

Algorithm Details

Variant Processing Strategy

ApplyVariants implements a HashMap-based variant application algorithm that handles overlapping variants using allele count comparison:

Conflict Resolution

When multiple variants overlap at the same genomic position, ApplyVariants resolves conflicts by selecting the variant with the highest allele count from the VCF file. This ensures that the most supported variant is applied when conflicts occur.

Variant Type Handling

Coverage Integration

When coverage data is provided, the algorithm applies a dual-filtering approach:

Memory Management

The tool uses efficient data structures for large-scale variant processing:

Sequence Name Handling

The algorithm supports flexible sequence renaming with multiple strategies:

Performance Characteristics

ApplyVariants is designed for efficient processing of large genomic datasets:

Support

For questions and support: