RandomGenome

Script: randomgenome.sh Package: synth Class: RandomGenome.java

Generates a random, (probably) repeat-free genome with specified characteristics including GC content, chromosome count, and nucleotide or amino acid sequences.

Basic Usage

randomgenome.sh len=<total size> chroms=<int> gc=<float> out=<file>

Creates random genome sequences with customizable parameters for testing, benchmarking, or simulation purposes.

Parameters

RandomGenome supports parameters for output control, genome characteristics, sequence composition, and reproducibility.

Output Parameters

out=<file>
Output file for the generated genome sequences. Required parameter.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: f

Genome Structure Parameters

len=100000
Total genome size in bases. Can use K/M/G suffixes (e.g., 1M for 1 million bases). Default: 100000
chroms=1
Number of separate chromosomes/contigs to generate. The total length is divided evenly among chromosomes. Default: 1
pad=0
Add this many N characters to chromosome ends. These padding bases do not count toward the total genome size. Default: 0

Sequence Composition Parameters

gc=0.5
GC fraction for nucleotide sequences (range 0.0-1.0). Controls the proportion of G and C bases versus A and T bases. Default: 0.5 (50% GC content)
nopoly=f
Ban homopolymers (consecutive identical bases/amino acids). When true, prevents runs of the same character. Default: f
amino=f
Produce random amino acid sequences instead of nucleotide sequences. Output will contain protein sequences rather than DNA. Default: f
includestop=f
Include stop codons in random amino acid sequences. Only applies when amino=t. Default: f

Reproducibility Parameters

seed=-1
Random number generator seed. Set to a positive number for deterministic, reproducible output. Default: -1 (random seed)

Examples

Basic Genome Generation

randomgenome.sh len=1M chroms=5 gc=0.4 out=test_genome.fa

Generates a 1 megabase genome split into 5 chromosomes with 40% GC content.

High GC Content Genome

randomgenome.sh len=500K chroms=1 gc=0.7 nopoly=t out=high_gc_genome.fa

Creates a 500 kilobase single chromosome with 70% GC content and no homopolymer runs.

Reproducible Generation

randomgenome.sh len=100K chroms=3 gc=0.45 seed=12345 out=reproducible_genome.fa

Generates a reproducible 100 kilobase genome using a fixed seed for consistent results across runs.

Amino Acid Sequences

randomgenome.sh len=10K amino=t includestop=t chroms=1 out=random_proteins.fa

Creates random protein sequences totaling 10 kilobases with stop codons included.

Padded Contigs

randomgenome.sh len=2M chroms=10 pad=100 gc=0.5 out=padded_genome.fa

Generates a 2 megabase genome with 10 chromosomes, each padded with 100 N's at both ends.

Algorithm Details

RandomGenome implements two distinct generation methods determined by the Shared.AMINO_IN flag:

Nucleotide Generation Implementation

Amino Acid Generation Implementation

Output Formatting

Performance Characteristics

Randomization Quality

The tool generates high-quality random sequences that are probably repeat-free due to the extremely low probability of generating identical long sequences. The random number generation uses Java's robust pseudorandom algorithms, and the optional seed parameter allows for reproducible generation when needed for testing or benchmarking.

Use Cases

Support

For questions and support: