ApplyVariants
Mutates a reference by applying a set of variants. When 2 variants overlap, the one with the higher allele count is used.
Basic Usage
applyvariants.sh in=<input file> vcf=<vcf file> out=<output file>
ApplyVariants processes reference sequences (FASTA format) and applies variants from a VCF file to generate mutated sequences. The tool is particularly useful for creating consensus sequences or strain-specific references by applying known variants to a reference genome.
Parameters
Parameters are organized into functional groups for processing variants, handling coverage data, renaming sequences, and controlling Java execution settings.
Standard parameters
- in=<file>
- Reference fasta input file containing the sequences to be mutated.
- vcf=<file>
- VCF file containing variants to apply to the reference. Must be properly formatted with chromosome/scaffold names matching the reference.
- basecov=<file>
- Optional per-base coverage file from BBMap or Pileup. Used with mincov parameter to mask low-coverage regions. Format should match the reference sequence names.
- out=<file>
- Output fasta file containing the mutated sequences with variants applied.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default is false, requiring explicit confirmation to overwrite.
- ziplevel=2
- (zl) Set compression level from 1 (lowest/fastest) through 9 (maximum/slowest) to change compression level for output files; lower compression is faster. Default is 2.
Processing parameters
- mincov=0
- Minimum coverage threshold. If positive and depth is below this value, change reference bases to N. Requires a coverage file specified with basecov parameter. Useful for masking unreliable low-coverage regions.
- maxindel=-1
- Maximum indel length to process. If positive, ignore indels longer than this value. Set to -1 (default) to process all indels regardless of size. Useful for filtering out very large structural variants.
- noframeshifts=f
- Filter out frameshifting indels. Set to true to ignore indels that are not a multiple of 3 in length, preserving reading frame in coding sequences. Useful for protein-coding regions.
Renaming parameters
- name=
- Optional new name for output sequences. If specified, all sequences will be renamed using this base name. Can be combined with addnumbers and prefix options.
- addnumbers=f
- Add sequential numbers (_1, _2, etc.) to ensure sequence names are unique when renaming. Default is false. Set to true when using name parameter with multiple sequences.
- prefix=t
- Use the name parameter as a prefix to the original name, instead of completely replacing the original name. Default is true. When false, completely replaces existing names.
- delimiter=_
- Symbol to place between parts of the new name when using prefix mode. Default is underscore (_). For space or tab, use the literal words "space" or "tab".
Java Parameters
- -Xmx
- Set Java's maximum memory usage, overriding autodetection. Examples: -Xmx20g specifies 20 gigabytes of RAM, -Xmx200m specifies 200 megabytes. The maximum is typically 85% of physical memory.
- -eoom
- Exit on out-of-memory. This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92 or later.
- -da
- Disable Java assertions. Can provide a small performance improvement in production use.
Examples
Basic Variant Application
applyvariants.sh in=reference.fasta vcf=variants.vcf out=mutated.fasta
Apply all variants from the VCF file to the reference genome, creating a consensus sequence.
Coverage-Filtered Variant Application
applyvariants.sh in=reference.fasta vcf=variants.vcf basecov=coverage.txt mincov=10 out=filtered_consensus.fasta
Apply variants but mask regions with coverage below 10x as N bases, ensuring only well-supported regions are included in the consensus.
Coding Sequence Variant Application
applyvariants.sh in=cds.fasta vcf=coding_variants.vcf noframeshifts=t maxindel=50 out=mutated_cds.fasta
Apply variants to coding sequences while filtering out frameshift indels and very large indels (>50bp) to maintain protein-coding integrity.
Renamed Output Sequences
applyvariants.sh in=reference.fasta vcf=strain_variants.vcf name=strain_X addnumbers=t out=strain_sequences.fasta
Apply variants and rename all output sequences with "strain_X" prefix, adding sequential numbers for uniqueness.
Algorithm Details
Variant Processing Strategy
ApplyVariants implements a HashMap-based variant application algorithm that handles overlapping variants using allele count comparison:
Conflict Resolution
When multiple variants overlap at the same genomic position, ApplyVariants resolves conflicts by selecting the variant with the highest allele count from the VCF file. This ensures that the most supported variant is applied when conflicts occur.
Variant Type Handling
- Substitutions: Direct base replacement preserving sequence length
- Insertions: Inserted bases are added at the specified position
- Deletions: Reference bases are removed from the specified range
- Complex variants: Combined insertions/deletions handled as atomic operations
Coverage Integration
When coverage data is provided, the algorithm applies a dual-filtering approach:
- Low-coverage regions (below mincov threshold) are masked as N bases
- Indels in low-coverage regions are filtered out before application
- Substitutions in low-coverage regions result in N bases rather than the variant base
Memory Management
The tool uses efficient data structures for large-scale variant processing:
- HashMap-based variant storage: O(1) lookup time for variants by genomic position
- ByteBuilder for sequence construction: Efficient string building for mutated sequences
- CoverageArray integration: Memory-efficient storage of per-base coverage data
Sequence Name Handling
The algorithm supports flexible sequence renaming with multiple strategies:
- Prefix mode: Maintains original names with added prefix
- Replacement mode: Completely replaces existing names
- Numbering system: Adds sequential identifiers for uniqueness
- Custom delimiters: Configurable separators between name components
Performance Characteristics
ApplyVariants is designed for efficient processing of large genomic datasets:
- Linear time complexity: O(n) where n is the total sequence length
- Memory usage: Approximately 4-8GB for typical mammalian genomes
- Scalability: Handles thousands of variants per chromosome efficiently
- I/O optimization: Streaming processing minimizes memory footprint
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org