RandomGenome

Basic Usage

randomgenome.sh len=<total size> chroms=<int> gc=<float> out=<file>

Creates random genome sequences with customizable parameters for testing, benchmarking, or simulation purposes.

Parameters

RandomGenome supports parameters for output control, genome characteristics, sequence composition, and reproducibility.

Output Parameters

out=<file>: Output file for the generated genome sequences. Required parameter.
in=<file>: Optional input clade or fasta file. If specified, the synthetic genome will conserve the input kmer frequencies.
k=5: Kmer length for base frequencies (2-5). Default: 5
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: f

Genome Structure Parameters

len=100000: Total genome size in bases. Can use K/M/G suffixes (e.g., 1M for 1 million bases). Default: 100000
chroms=1: Number of separate chromosomes/contigs to generate. The total length is divided evenly among chromosomes. Default: 1
pad=0: Add this many N characters to chromosome ends. These padding bases do not count toward the total genome size. Default: 0

Sequence Composition Parameters

gc=0.5: GC fraction for nucleotide sequences (range 0.0-1.0). Controls the proportion of G and C bases versus A and T bases. Default: 0.5 (50% GC content)
nopoly=f: Ban homopolymers (consecutive identical bases/amino acids). When true, prevents runs of the same character. Default: f
amino=f: Produce random amino acid sequences instead of nucleotide sequences. Output will contain protein sequences rather than DNA. Default: f
includestop=f: Include stop codons in random amino acid sequences. Only applies when amino=t. Default: f

Reproducibility Parameters

seed=-1: Random number generator seed. Set to a positive number for deterministic, reproducible output. Default: -1 (random seed)

Examples

Basic Genome Generation

randomgenome.sh len=1M chroms=5 gc=0.4 out=test_genome.fa

Generates a 1 megabase genome split into 5 chromosomes with 40% GC content.

High GC Content Genome

randomgenome.sh len=500K chroms=1 gc=0.7 nopoly=t out=high_gc_genome.fa

Creates a 500 kilobase single chromosome with 70% GC content and no homopolymer runs.

Reproducible Generation

randomgenome.sh len=100K chroms=3 gc=0.45 seed=12345 out=reproducible_genome.fa

Generates a reproducible 100 kilobase genome using a fixed seed for consistent results across runs.

Amino Acid Sequences

randomgenome.sh len=10K amino=t includestop=t chroms=1 out=random_proteins.fa

Creates random protein sequences totaling 10 kilobases with stop codons included.

Padded Contigs

randomgenome.sh len=2M chroms=10 pad=100 gc=0.5 out=padded_genome.fa

Generates a 2 megabase genome with 10 chromosomes, each padded with 100 N's at both ends.

Algorithm Details

RandomGenome implements two distinct generation methods determined by the Shared.AMINO_IN flag:

Nucleotide Generation Implementation

50% GC Content: Direct selection using AminoAcid.numberToBase[randy.nextInt(4)] for equal probability access to all four bases
Biased GC Content: Two-stage selection where randy.nextFloat()>=gc determines AT vs GC group, followed by randy.nextBoolean() for specific base selection within the group
Homopolymer Prevention: Iterative regeneration loop using while(noPoly && b==prev) until current base differs from previous byte value

Amino Acid Generation Implementation

Uses AminoAcid.numberToAcid array with limit=(includeStop ? acids.length : acids.length-1) for conditional stop codon inclusion
Selection via acids[randy.nextInt(limit)] from the standard amino acid array
Padding regions use 'X' character instead of 'N' for amino acid sequences
Headers generated as "gene"+chrom instead of "contig"+chrom for protein sequences

Output Formatting

Sequences are formatted according to FASTA standards with configurable line wrapping
Chromosome headers are automatically generated as "contig1", "contig2", etc. for nucleotides or "gene1", "gene2", etc. for amino acids
Padding regions (if specified) are added as N's (nucleotides) or X's (amino acids) at both ends of each chromosome
Total length is divided evenly among chromosomes, with padding not counting toward the specified length

Performance Characteristics

Memory Usage: Low memory footprint using streaming output with configurable wrap buffer
Speed: Optimized for equal GC content (50%) with direct random base selection
Scalability: Can generate genomes up to maximum array length per chromosome
Thread Safety: Uses thread-local random number generation for consistent results

Randomization Quality

The tool generates high-quality random sequences that are probably repeat-free due to the extremely low probability of generating identical long sequences. The random number generation uses Java's robust pseudorandom algorithms, and the optional seed parameter allows for reproducible generation when needed for testing or benchmarking.

Use Cases

Algorithm Testing: Generate test genomes with known characteristics for validating bioinformatics tools
Performance Benchmarking: Create large datasets for testing tool performance and memory usage
Simulation Studies: Provide background sequences for simulation of evolutionary processes
Method Development: Create controlled datasets for developing new analysis methods
Educational Purposes: Generate example genomes for teaching bioinformatics concepts

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org