MakeContaminatedGenomes
Generates synthetic contaminated partial genomes from clean genomes. Output is formatted as (prefix)_bases1_fname1_bases2_fname2_counter_(suffix).
Basic Usage
makecontaminatedgenomes.sh in=<file> out=<pattern>
This tool creates synthetic contaminated genomes by fusing two randomly selected genomes together. It takes a file containing input file paths and generates chimeric sequences with customizable size fractions and mutation rates.
Parameters
Parameters are organized by their function in the contaminated genome generation process.
I/O Parameters
- in=<file>
- A file containing one input file path per line. Each line should specify the path to a FASTA file that will be used as a source genome for creating contaminated sequences.
- out=<pattern>
- A file name containing a # symbol (or other regex). The regex will be replaced by source filenames and size information to create descriptive output file names.
Processing Parameters
- count=1
- Number of output files to make. Each output file will be a contaminated genome created from two randomly selected input genomes.
- seed=-1
- RNG seed; negative for a random seed. Controls the random number generator for reproducible results. Use a specific positive value to ensure identical output across runs.
- exp1=1
- Exponent for genome 1 size fraction. Controls the size distribution of the first genome fragment. Higher values bias toward smaller fractions (Math.pow(random, exponent)).
- exp2=1
- Exponent for genome 2 size fraction. Controls the size distribution of the second genome fragment. Higher values bias toward smaller fractions (Math.pow(random, exponent)).
- subrate=0
- Rate to add substitutions to new genomes (0-1). Probability of substituting each base with a different nucleotide. Value of 0 means no substitutions.
- indelrate=0
- Rate to add substitutions to new genomes (0-1). Probability of inserting or deleting bases. Combined with subrate to create the total error rate.
- regex=#
- Use this substitution regex for replacement. The pattern in the output filename that will be replaced with the generated filename containing size and source information.
- delimiter=_
- Use this delimiter in the new file names. Character used to separate components in the automatically generated output filenames.
Additional Parameters
- verbose=f
- Print status messages during processing. Enables detailed output for debugging and monitoring progress.
- chimeras
- Alias for count parameter. Number of chimeric genomes to create.
- exp
- Set both exp1 and exp2 to the same value. Convenient way to apply the same exponent to both genome size fractions.
- id
- Alias for setting identity level. Automatically calculates subrate (99% of error rate) and indelrate (1% of error rate) to achieve the specified identity.
- ani
- Alias for id parameter. Sets average nucleotide identity by calculating appropriate substitution and indel rates.
- identity
- Alias for id parameter. Sets sequence identity by calculating mutation rates.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Contaminated Genome Generation
makecontaminatedgenomes.sh in=genome_list.txt out=contaminated_#.fasta
Creates one contaminated genome from randomly selected genomes listed in genome_list.txt. The output file will have a descriptive name containing size and source information.
Multiple Contaminated Genomes with Mutations
makecontaminatedgenomes.sh in=genome_list.txt out=contaminated_#.fasta count=10 subrate=0.01 indelrate=0.001
Creates 10 contaminated genomes with 1% substitution rate and 0.1% indel rate, adding realistic mutations to simulate evolutionary divergence.
Biased Fragment Sizes
makecontaminatedgenomes.sh in=genome_list.txt out=contaminated_#.fasta exp1=2 exp2=0.5 seed=12345
Creates contaminated genomes with biased fragment sizes: exp1=2 favors smaller fragments from genome 1, exp2=0.5 favors larger fragments from genome 2. Uses seed for reproducibility.
Identity-Based Generation
makecontaminatedgenomes.sh in=genome_list.txt out=contaminated_#.fasta count=5 identity=0.95
Creates 5 contaminated genomes with 95% identity, automatically calculating appropriate substitution (94.05%) and indel (0.95%) rates.
Algorithm Details
Chimeric Genome Creation Process
MakeContaminatedGenomes creates synthetic contaminated genomes through a multi-step process that simulates realistic contamination scenarios:
Genome Selection and Pairing
The algorithm randomly selects two different genomes from the input file list for each contaminated genome. It ensures no genome is paired with itself by continuing to select until two different genomes are chosen.
Fragment Size Calculation
For each genome pair, the algorithm calculates size fractions using power distributions:
- fracA = Math.pow(random(), exp1) - Controls genome A fragment size
- fracB = Math.pow(random(), exp2) - Controls genome B fragment size
- Higher exponents bias toward smaller fragments (more aggressive truncation)
- Lower exponents bias toward larger fragments (more complete genomes)
- Exponent of 1 provides uniform distribution
Fragment Extraction Strategy
When genomeFraction < 1, the algorithm extracts a circular fragment:
- Calculates retain length: bases_to_keep = original_length × genomeFraction
- Selects random starting position in the genome
- Extracts sequence from start position, wrapping around to beginning if needed
- Marks the wraparound junction as a chimeric break (mutationsAdded++)
Mutation Application
If error rates are specified, mutations are applied base-by-base:
- Substitutions: Replace base with one of the other three nucleotides
- Deletions: Skip base in output (50% of indel events)
- Insertions: Add random nucleotide, reprocess current position (50% of indel events)
- Total error rate = subRate + indelRate
- Only fully-defined nucleotides (ACGT) are mutated
Output File Naming
The algorithm generates descriptive filenames containing all relevant information:
Format: (prefix)_sizeA_fracA_nameA_sizeB_fracB_nameB_counter_(suffix)
- Genomes are ordered by size (larger genome first in filename)
- Sizes are actual base counts after fragment extraction
- Fractions are formatted to 3 decimal places
- Names are core filenames without paths/extensions
- Counter distinguishes multiple output files
Performance Characteristics
- Memory usage: Holds two complete genomes in memory simultaneously
- Processing time: Linear with total genome size and mutation rate
- Thread safety: Uses thread-local random number generators
- I/O efficiency: Streams output directly to files, no intermediate storage
Scientific Applications
This tool is valuable for:
- Creating training datasets for contamination detection algorithms
- Benchmarking assembly tools against known contamination levels
- Simulating horizontal gene transfer events
- Testing binning and classification methods
- Generating synthetic metagenomes with controlled contamination
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org