MakeContaminatedGenomes

Basic Usage

makecontaminatedgenomes.sh in=<file> out=<pattern>

This tool creates synthetic contaminated genomes by fusing two randomly selected genomes together. It takes a file containing input file paths and generates chimeric sequences with customizable size fractions and mutation rates.

Parameters

Parameters are organized by their function in the contaminated genome generation process.

I/O Parameters

in=<file>: A file containing one input file path per line. Each line should specify the path to a FASTA file that will be used as a source genome for creating contaminated sequences.
out=<pattern>: A file name containing a # symbol (or other regex). The regex will be replaced by source filenames and size information to create descriptive output file names.

Processing Parameters

count=1: Number of output files to make. Each output file will be a contaminated genome created from two randomly selected input genomes.
seed=-1: RNG seed; negative for a random seed. Controls the random number generator for reproducible results. Use a specific positive value to ensure identical output across runs.
exp1=1: Exponent for genome 1 size fraction. Controls the size distribution of the first genome fragment. Higher values bias toward smaller fractions (Math.pow(random, exponent)).
exp2=1: Exponent for genome 2 size fraction. Controls the size distribution of the second genome fragment. Higher values bias toward smaller fractions (Math.pow(random, exponent)).
subrate=0: Rate to add substitutions to new genomes (0-1). Probability of substituting each base with a different nucleotide. Value of 0 means no substitutions.
indelrate=0: Rate to add substitutions to new genomes (0-1). Probability of inserting or deleting bases. Combined with subrate to create the total error rate.
regex=#: Use this substitution regex for replacement. The pattern in the output filename that will be replaced with the generated filename containing size and source information.
delimiter=_: Use this delimiter in the new file names. Character used to separate components in the automatically generated output filenames.

Additional Parameters

verbose=f: Print status messages during processing. Enables detailed output for debugging and monitoring progress.
chimeras: Alias for count parameter. Number of chimeric genomes to create.
exp: Set both exp1 and exp2 to the same value. Convenient way to apply the same exponent to both genome size fractions.
id: Alias for setting identity level. Automatically calculates subrate (99% of error rate) and indelrate (1% of error rate) to achieve the specified identity.
ani: Alias for id parameter. Sets average nucleotide identity by calculating appropriate substitution and indel rates.
identity: Alias for id parameter. Sets sequence identity by calculating mutation rates.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Contaminated Genome Generation

makecontaminatedgenomes.sh in=genome_list.txt out=contaminated_#.fasta

Creates one contaminated genome from randomly selected genomes listed in genome_list.txt. The output file will have a descriptive name containing size and source information.

Multiple Contaminated Genomes with Mutations

makecontaminatedgenomes.sh in=genome_list.txt out=contaminated_#.fasta count=10 subrate=0.01 indelrate=0.001

Creates 10 contaminated genomes with 1% substitution rate and 0.1% indel rate, adding realistic mutations to simulate evolutionary divergence.

Biased Fragment Sizes

makecontaminatedgenomes.sh in=genome_list.txt out=contaminated_#.fasta exp1=2 exp2=0.5 seed=12345

Creates contaminated genomes with biased fragment sizes: exp1=2 favors smaller fragments from genome 1, exp2=0.5 favors larger fragments from genome 2. Uses seed for reproducibility.

Identity-Based Generation

makecontaminatedgenomes.sh in=genome_list.txt out=contaminated_#.fasta count=5 identity=0.95

Creates 5 contaminated genomes with 95% identity, automatically calculating appropriate substitution (94.05%) and indel (0.95%) rates.

Algorithm Details

Chimeric Genome Creation Process

MakeContaminatedGenomes creates synthetic contaminated genomes through a multi-step process that simulates realistic contamination scenarios:

Genome Selection and Pairing

The algorithm randomly selects two different genomes from the input file list for each contaminated genome. It ensures no genome is paired with itself by continuing to select until two different genomes are chosen.

Fragment Size Calculation

For each genome pair, the algorithm calculates size fractions using power distributions:

fracA = Math.pow(random(), exp1) - Controls genome A fragment size
fracB = Math.pow(random(), exp2) - Controls genome B fragment size
Higher exponents bias toward smaller fragments (more aggressive truncation)
Lower exponents bias toward larger fragments (more complete genomes)
Exponent of 1 provides uniform distribution

Fragment Extraction Strategy

When genomeFraction < 1, the algorithm extracts a circular fragment:

Calculates retain length: bases_to_keep = original_length × genomeFraction
Selects random starting position in the genome
Extracts sequence from start position, wrapping around to beginning if needed
Marks the wraparound junction as a chimeric break (mutationsAdded++)

Mutation Application

If error rates are specified, mutations are applied base-by-base:

Substitutions: Replace base with one of the other three nucleotides
Deletions: Skip base in output (50% of indel events)
Insertions: Add random nucleotide, reprocess current position (50% of indel events)
Total error rate = subRate + indelRate
Only fully-defined nucleotides (ACGT) are mutated

Output File Naming

The algorithm generates descriptive filenames containing all relevant information:

Format: (prefix)_sizeA_fracA_nameA_sizeB_fracB_nameB_counter_(suffix)

Genomes are ordered by size (larger genome first in filename)
Sizes are actual base counts after fragment extraction
Fractions are formatted to 3 decimal places
Names are core filenames without paths/extensions
Counter distinguishes multiple output files

Performance Characteristics

Memory usage: Holds two complete genomes in memory simultaneously
Processing time: Linear with total genome size and mutation rate
Thread safety: Uses thread-local random number generators
I/O efficiency: Streams output directly to files, no intermediate storage

Scientific Applications

This tool is valuable for:

Creating training datasets for contamination detection algorithms
Benchmarking assembly tools against known contamination levels
Simulating horizontal gene transfer events
Testing binning and classification methods
Generating synthetic metagenomes with controlled contamination

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org