MakeContaminatedGenomes

Script: makecontaminatedgenomes.sh Package: synth Class: MakeContaminatedGenomes.java

Generates synthetic contaminated partial genomes from clean genomes. Output is formatted as (prefix)_bases1_fname1_bases2_fname2_counter_(suffix).

Basic Usage

makecontaminatedgenomes.sh in=<file> out=<pattern>

This tool creates synthetic contaminated genomes by fusing two randomly selected genomes together. It takes a file containing input file paths and generates chimeric sequences with customizable size fractions and mutation rates.

Parameters

Parameters are organized by their function in the contaminated genome generation process.

I/O Parameters

in=<file>
A file containing one input file path per line. Each line should specify the path to a FASTA file that will be used as a source genome for creating contaminated sequences.
out=<pattern>
A file name containing a # symbol (or other regex). The regex will be replaced by source filenames and size information to create descriptive output file names.

Processing Parameters

count=1
Number of output files to make. Each output file will be a contaminated genome created from two randomly selected input genomes.
seed=-1
RNG seed; negative for a random seed. Controls the random number generator for reproducible results. Use a specific positive value to ensure identical output across runs.
exp1=1
Exponent for genome 1 size fraction. Controls the size distribution of the first genome fragment. Higher values bias toward smaller fractions (Math.pow(random, exponent)).
exp2=1
Exponent for genome 2 size fraction. Controls the size distribution of the second genome fragment. Higher values bias toward smaller fractions (Math.pow(random, exponent)).
subrate=0
Rate to add substitutions to new genomes (0-1). Probability of substituting each base with a different nucleotide. Value of 0 means no substitutions.
indelrate=0
Rate to add substitutions to new genomes (0-1). Probability of inserting or deleting bases. Combined with subrate to create the total error rate.
regex=#
Use this substitution regex for replacement. The pattern in the output filename that will be replaced with the generated filename containing size and source information.
delimiter=_
Use this delimiter in the new file names. Character used to separate components in the automatically generated output filenames.

Additional Parameters

verbose=f
Print status messages during processing. Enables detailed output for debugging and monitoring progress.
chimeras
Alias for count parameter. Number of chimeric genomes to create.
exp
Set both exp1 and exp2 to the same value. Convenient way to apply the same exponent to both genome size fractions.
id
Alias for setting identity level. Automatically calculates subrate (99% of error rate) and indelrate (1% of error rate) to achieve the specified identity.
ani
Alias for id parameter. Sets average nucleotide identity by calculating appropriate substitution and indel rates.
identity
Alias for id parameter. Sets sequence identity by calculating mutation rates.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Contaminated Genome Generation

makecontaminatedgenomes.sh in=genome_list.txt out=contaminated_#.fasta

Creates one contaminated genome from randomly selected genomes listed in genome_list.txt. The output file will have a descriptive name containing size and source information.

Multiple Contaminated Genomes with Mutations

makecontaminatedgenomes.sh in=genome_list.txt out=contaminated_#.fasta count=10 subrate=0.01 indelrate=0.001

Creates 10 contaminated genomes with 1% substitution rate and 0.1% indel rate, adding realistic mutations to simulate evolutionary divergence.

Biased Fragment Sizes

makecontaminatedgenomes.sh in=genome_list.txt out=contaminated_#.fasta exp1=2 exp2=0.5 seed=12345

Creates contaminated genomes with biased fragment sizes: exp1=2 favors smaller fragments from genome 1, exp2=0.5 favors larger fragments from genome 2. Uses seed for reproducibility.

Identity-Based Generation

makecontaminatedgenomes.sh in=genome_list.txt out=contaminated_#.fasta count=5 identity=0.95

Creates 5 contaminated genomes with 95% identity, automatically calculating appropriate substitution (94.05%) and indel (0.95%) rates.

Algorithm Details

Chimeric Genome Creation Process

MakeContaminatedGenomes creates synthetic contaminated genomes through a multi-step process that simulates realistic contamination scenarios:

Genome Selection and Pairing

The algorithm randomly selects two different genomes from the input file list for each contaminated genome. It ensures no genome is paired with itself by continuing to select until two different genomes are chosen.

Fragment Size Calculation

For each genome pair, the algorithm calculates size fractions using power distributions:

Fragment Extraction Strategy

When genomeFraction < 1, the algorithm extracts a circular fragment:

Mutation Application

If error rates are specified, mutations are applied base-by-base:

Output File Naming

The algorithm generates descriptive filenames containing all relevant information:

Format: (prefix)_sizeA_fracA_nameA_sizeB_fracB_nameB_counter_(suffix)

Performance Characteristics

Scientific Applications

This tool is valuable for:

Support

For questions and support: