Mutate
Creates a mutant version of a genome with customizable substitution and indel rates. Also produces a VCF file listing the added mutations for downstream analysis.
Basic Usage
mutate.sh in=<input file> out=<output file> id=<identity>
Creates a mutated version of the input genome with the specified target identity level.
Parameters
Parameters are organized by their function in the mutation process. All parameters from the shell script are documented below.
I/O parameters
- in=<file>
- Input genome in FASTA format. Can be compressed with gzip.
- out=<file>
- Output mutant genome in FASTA format. Will contain the mutated sequences with the same headers (unless prefix is specified).
- vcf=<file>
- Output VCF file showing all variations added during mutation. Contains detailed information about each substitution, insertion, and deletion with precise genomic coordinates.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default is true, allowing file overwriting.
- ziplevel=2
- (zl) Set compression level from 1 (lowest/fastest) through 9 (maximum/slowest). Lower compression is faster but produces larger files. Default: 2.
Processing parameters
- subrate=0
- Substitution rate, 0 to 1. Fraction of bases that will be changed to a different base. Can be specified as decimal (0.05) or percentage (5). Default: 0.
- insrate=0
- Insertion rate, 0 to 1. Fraction of positions where insertions will occur. Insertions add new bases to the sequence. Can be specified as decimal or percentage. Default: 0.
- delrate=0
- Deletion rate, 0 to 1. Fraction of positions where deletions will occur. Deletions remove bases from the sequence. Can be specified as decimal or percentage. Default: 0.
- indelrate=0
- Combined indel rate that sets both insertion and deletion rates. Each rate will be set to half of this value (e.g., indelrate=0.02 sets insrate=0.01 and delrate=0.01). Default: 0.
- maxindel=1
- Maximum length of individual insertions or deletions. The actual length is chosen randomly up to this maximum using a triple-minimum distribution for realistic size variation. Default: 1.
- indelspacing=3
- Minimum distance required between subsequent indels. Prevents clustering of insertions and deletions, which can make variant calling more difficult. Default: 3.
- id=1
- Target identity level, 0 to 1 (or 0-100 as percentage). When specified, this overrides subrate and indelrate settings. The mutation distribution is 99% substitutions and 1% indels. For example, id=0.95 creates 5% total mutations. Default: 1 (no mutations).
- fraction=1
- Genome fraction to retain, 0 to 1 (or 0-100 as percentage). A value less than 1 randomly selects that fraction of each sequence, possibly creating chimeric junctions. Not compatible with VCF output. Default: 1 (retain entire genome).
- period=-1
- If positive, places exactly one mutation every X bases in a regular pattern. This creates uniform mutation spacing instead of random distribution. Useful for creating predictable test cases. Default: -1 (random distribution).
- prefix=
- Rename output contigs with this prefix followed by a sequential number. For example, prefix=mutant_ creates contigs named mutant_1, mutant_2, etc. Default: empty (preserve original names).
- amino=f
- Treat the input sequence as amino acid rather than nucleotide sequence. Changes the mutation alphabet to amino acids instead of ATCG. Default: false.
- ploidy=1
- Set the ploidy level. Values greater than 1 enable heterozygous mutations and create multiple copies of each input sequence. For ploidy > 1, uses MutateGenome2 class. Default: 1.
- hetrate=0.5
- For polyploid genomes (ploidy > 1), fraction of mutations that are heterozygous versus homozygous. Only applies when ploidy > 1. Default: 0.5.
- nohomopolymers=f
- Prevent indels within homopolymers that could lead to ambiguous variant calls. For example, prevents inserting A between AA bases or deleting T from TTTT runs. Improves variant calling concordance by ~70% but is currently disabled. Default: false.
- pad=0
- Add this many random bases to both ends of input sequences. Individual control available with padleft and padright parameters. Useful for simulating incomplete assembly ends. Default: 0.
Java Parameters
- -Xmx
- Set Java's maximum memory usage, overriding automatic detection. Examples: -Xmx20g for 20 gigabytes, -Xmx200m for 200 megabytes. Maximum is typically 85% of physical memory. Default: 4g.
- -eoom
- Exit on out-of-memory exception rather than attempting recovery. Requires Java 8u92 or later. Useful for cluster environments where memory exhaustion should terminate the job.
- -da
- Disable Java assertions. Can provide minor performance improvement in production use but removes runtime safety checks during development.
Examples
Basic Mutation
# Create genome with 5% mutations (95% identity)
mutate.sh in=reference.fa out=mutant.fa id=0.95
# Specify mutation types explicitly
mutate.sh in=reference.fa out=mutant.fa subrate=0.04 indelrate=0.01
# Create mutations with VCF output
mutate.sh in=reference.fa out=mutant.fa vcf=mutations.vcf id=0.90
Advanced Mutation Control
# Large indels with minimum spacing
mutate.sh in=reference.fa out=mutant.fa maxindel=10 indelspacing=50 indelrate=0.005
# Regular mutation pattern every 100 bases
mutate.sh in=reference.fa out=mutant.fa period=100 subrate=0.01
# Rename contigs and add padding
mutate.sh in=reference.fa out=mutant.fa prefix=sim_ pad=1000 id=0.98
Amino Acid Sequences
# Mutate protein sequences
mutate.sh in=proteins.fa out=mutant_proteins.fa amino=t id=0.95
# Create partial genome with chimeric junctions
mutate.sh in=reference.fa out=partial.fa fraction=0.7 id=0.99
Algorithm Details
Mutation Engine Architecture
The mutation engine operates in two distinct modes depending on the period parameter:
Probabilistic Mode (period=-1, default)
Uses independent random evaluation at each position with float comparisons against error rate thresholds (lines 441-446 in MutateGenome.java). The algorithm:
- Conservation modeling: Uses ConservationModel class with sinewave patterns that can protect certain genomic regions from mutation based on shouldMutatePosition() method
- Indel spacing enforcement: Maintains minimum distance between indels using lastIndel tracking to prevent clustering
- Triple-minimum distribution: Indel lengths are selected using Tools.min(randy.nextInt(lim), randy.nextInt(lim), randy.nextInt(lim)) for realistic size distributions
- Homopolymer protection: Optional prevention of indels within homopolymer runs using isHomopolymerDel() and isHomopolymerIns() methods that create ambiguous variants
Period Mode (period > 0)
Places exactly one mutation every X bases in a regular pattern. Useful for:
- Creating predictable test datasets
- Uniform mutation density for benchmarking
- Systematic evaluation of variant callers
Mutation Type Distribution
When using the identity parameter (id), mutations are distributed as:
- 99% substitutions: Single base changes using AminoAcid.numberToBase[((AminoAcid.baseToNumber[b]+randy.nextInt(3)+1)&3)] for even distribution
- 1% indels: Equal split between insertions and deletions (insRate=delRate=x*0.005f)
VCF Generation
The VCF output includes:
- Precise coordinates: 1-based positioning following VCF 4.2 specification
- Variant condensation: Adjacent indels are automatically fused into complex variants when appropriate
- Quality scores: Fixed at 60.00 for simulated variants
- Metadata fields: Scaffold number, start/stop positions, and variant type annotations
Memory Efficiency
The algorithm processes sequences individually with streaming I/O, maintaining constant memory usage regardless of genome size. Default allocation of 4GB handles most genomes efficiently.
Amino Acid Support
For protein sequences, the mutation alphabet expands to all 21 amino acids including stop codons. Mutation selection uses modulo 21 arithmetic to ensure even distribution across amino acid types.
Performance Characteristics
Speed and Scalability
- Throughput: Processes ~10-50 Mbp/second depending on mutation rate and indel complexity
- Memory usage: Constant memory footprint, independent of genome size
- I/O efficiency: Streaming processing minimizes disk access
Recommended Settings
- Large genomes: Use streaming with moderate memory allocation (-Xmx8g)
- High mutation rates: Consider larger indelspacing values to maintain variant quality
- Variant calling evaluation: Enable nohomopolymers for cleaner variant calls
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org