ApplyVariants

Basic Usage

applyvariants.sh in=<input file> vcf=<vcf file> out=<output file>

ApplyVariants processes reference sequences (FASTA format) and applies variants from a VCF file to generate mutated sequences. The tool is particularly useful for creating consensus sequences or strain-specific references by applying known variants to a reference genome.

Parameters

Parameters are organized into functional groups for processing variants, handling coverage data, renaming sequences, and controlling Java execution settings.

Standard parameters

in=<file>: Reference fasta input file containing the sequences to be mutated.
vcf=<file>: VCF file containing variants to apply to the reference. Must be properly formatted with chromosome/scaffold names matching the reference.
basecov=<file>: Optional per-base coverage file from BBMap or Pileup. Used with mincov parameter to mask low-coverage regions. Format should match the reference sequence names.
out=<file>: Output fasta file containing the mutated sequences with variants applied.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default is false, requiring explicit confirmation to overwrite.
ziplevel=2: (zl) Set compression level from 1 (lowest/fastest) through 9 (maximum/slowest) to change compression level for output files; lower compression is faster. Default is 2.

Processing parameters

mincov=0: Minimum coverage threshold. If positive and depth is below this value, change reference bases to N. Requires a coverage file specified with basecov parameter. Useful for masking unreliable low-coverage regions.
maxindel=-1: Maximum indel length to process. If positive, ignore indels longer than this value. Set to -1 (default) to process all indels regardless of size. Useful for filtering out very large structural variants.
noframeshifts=f: Filter out frameshifting indels. Set to true to ignore indels that are not a multiple of 3 in length, preserving reading frame in coding sequences. Useful for protein-coding regions.

Renaming parameters

name=: Optional new name for output sequences. If specified, all sequences will be renamed using this base name. Can be combined with addnumbers and prefix options.
addnumbers=f: Add sequential numbers (_1, _2, etc.) to ensure sequence names are unique when renaming. Default is false. Set to true when using name parameter with multiple sequences.
prefix=t: Use the name parameter as a prefix to the original name, instead of completely replacing the original name. Default is true. When false, completely replaces existing names.
delimiter=_: Symbol to place between parts of the new name when using prefix mode. Default is underscore (_). For space or tab, use the literal words "space" or "tab".

Java Parameters

-Xmx: Set Java's maximum memory usage, overriding autodetection. Examples: -Xmx20g specifies 20 gigabytes of RAM, -Xmx200m specifies 200 megabytes. The maximum is typically 85% of physical memory.
-eoom: Exit on out-of-memory. This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92 or later.
-da: Disable Java assertions. Can provide a small performance improvement in production use.

Examples

Basic Variant Application

applyvariants.sh in=reference.fasta vcf=variants.vcf out=mutated.fasta

Apply all variants from the VCF file to the reference genome, creating a consensus sequence.

Coverage-Filtered Variant Application

applyvariants.sh in=reference.fasta vcf=variants.vcf basecov=coverage.txt mincov=10 out=filtered_consensus.fasta

Apply variants but mask regions with coverage below 10x as N bases, ensuring only well-supported regions are included in the consensus.

Coding Sequence Variant Application

applyvariants.sh in=cds.fasta vcf=coding_variants.vcf noframeshifts=t maxindel=50 out=mutated_cds.fasta

Apply variants to coding sequences while filtering out frameshift indels and very large indels (>50bp) to maintain protein-coding integrity.

Renamed Output Sequences

applyvariants.sh in=reference.fasta vcf=strain_variants.vcf name=strain_X addnumbers=t out=strain_sequences.fasta

Apply variants and rename all output sequences with "strain_X" prefix, adding sequential numbers for uniqueness.

Algorithm Details

Variant Processing Strategy

ApplyVariants implements a HashMap-based variant application algorithm that handles overlapping variants using allele count comparison:

Conflict Resolution

When multiple variants overlap at the same genomic position, ApplyVariants resolves conflicts by selecting the variant with the highest allele count from the VCF file. This ensures that the most supported variant is applied when conflicts occur.

Variant Type Handling

Substitutions: Direct base replacement preserving sequence length
Insertions: Inserted bases are added at the specified position
Deletions: Reference bases are removed from the specified range
Complex variants: Combined insertions/deletions handled as atomic operations

Coverage Integration

When coverage data is provided, the algorithm applies a dual-filtering approach:

Low-coverage regions (below mincov threshold) are masked as N bases
Indels in low-coverage regions are filtered out before application
Substitutions in low-coverage regions result in N bases rather than the variant base

Memory Management

The tool uses efficient data structures for large-scale variant processing:

HashMap-based variant storage: O(1) lookup time for variants by genomic position
ByteBuilder for sequence construction: Efficient string building for mutated sequences
CoverageArray integration: Memory-efficient storage of per-base coverage data

Sequence Name Handling

The algorithm supports flexible sequence renaming with multiple strategies:

Prefix mode: Maintains original names with added prefix
Replacement mode: Completely replaces existing names
Numbering system: Adds sequential identifiers for uniqueness
Custom delimiters: Configurable separators between name components

Performance Characteristics

ApplyVariants is designed for efficient processing of large genomic datasets:

Linear time complexity: O(n) where n is the total sequence length
Memory usage: Approximately 4-8GB for typical mammalian genomes
Scalability: Handles thousands of variants per chromosome efficiently
I/O optimization: Streaming processing minimizes memory footprint

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org