RandomGenome
Generates a random, (probably) repeat-free genome with specified characteristics including GC content, chromosome count, and nucleotide or amino acid sequences.
Basic Usage
randomgenome.sh len=<total size> chroms=<int> gc=<float> out=<file>
Creates random genome sequences with customizable parameters for testing, benchmarking, or simulation purposes.
Parameters
RandomGenome supports parameters for output control, genome characteristics, sequence composition, and reproducibility.
Output Parameters
- out=<file>
- Output file for the generated genome sequences. Required parameter.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: f
Genome Structure Parameters
- len=100000
- Total genome size in bases. Can use K/M/G suffixes (e.g., 1M for 1 million bases). Default: 100000
- chroms=1
- Number of separate chromosomes/contigs to generate. The total length is divided evenly among chromosomes. Default: 1
- pad=0
- Add this many N characters to chromosome ends. These padding bases do not count toward the total genome size. Default: 0
Sequence Composition Parameters
- gc=0.5
- GC fraction for nucleotide sequences (range 0.0-1.0). Controls the proportion of G and C bases versus A and T bases. Default: 0.5 (50% GC content)
- nopoly=f
- Ban homopolymers (consecutive identical bases/amino acids). When true, prevents runs of the same character. Default: f
- amino=f
- Produce random amino acid sequences instead of nucleotide sequences. Output will contain protein sequences rather than DNA. Default: f
- includestop=f
- Include stop codons in random amino acid sequences. Only applies when amino=t. Default: f
Reproducibility Parameters
- seed=-1
- Random number generator seed. Set to a positive number for deterministic, reproducible output. Default: -1 (random seed)
Examples
Basic Genome Generation
randomgenome.sh len=1M chroms=5 gc=0.4 out=test_genome.fa
Generates a 1 megabase genome split into 5 chromosomes with 40% GC content.
High GC Content Genome
randomgenome.sh len=500K chroms=1 gc=0.7 nopoly=t out=high_gc_genome.fa
Creates a 500 kilobase single chromosome with 70% GC content and no homopolymer runs.
Reproducible Generation
randomgenome.sh len=100K chroms=3 gc=0.45 seed=12345 out=reproducible_genome.fa
Generates a reproducible 100 kilobase genome using a fixed seed for consistent results across runs.
Amino Acid Sequences
randomgenome.sh len=10K amino=t includestop=t chroms=1 out=random_proteins.fa
Creates random protein sequences totaling 10 kilobases with stop codons included.
Padded Contigs
randomgenome.sh len=2M chroms=10 pad=100 gc=0.5 out=padded_genome.fa
Generates a 2 megabase genome with 10 chromosomes, each padded with 100 N's at both ends.
Algorithm Details
RandomGenome implements two distinct generation methods determined by the Shared.AMINO_IN flag:
Nucleotide Generation Implementation
- 50% GC Content: Direct selection using AminoAcid.numberToBase[randy.nextInt(4)] for equal probability access to all four bases
- Biased GC Content: Two-stage selection where randy.nextFloat()>=gc determines AT vs GC group, followed by randy.nextBoolean() for specific base selection within the group
- Homopolymer Prevention: Iterative regeneration loop using while(noPoly && b==prev) until current base differs from previous byte value
Amino Acid Generation Implementation
- Uses AminoAcid.numberToAcid array with limit=(includeStop ? acids.length : acids.length-1) for conditional stop codon inclusion
- Selection via acids[randy.nextInt(limit)] from the standard amino acid array
- Padding regions use 'X' character instead of 'N' for amino acid sequences
- Headers generated as "gene"+chrom instead of "contig"+chrom for protein sequences
Output Formatting
- Sequences are formatted according to FASTA standards with configurable line wrapping
- Chromosome headers are automatically generated as "contig1", "contig2", etc. for nucleotides or "gene1", "gene2", etc. for amino acids
- Padding regions (if specified) are added as N's (nucleotides) or X's (amino acids) at both ends of each chromosome
- Total length is divided evenly among chromosomes, with padding not counting toward the specified length
Performance Characteristics
- Memory Usage: Low memory footprint using streaming output with configurable wrap buffer
- Speed: Optimized for equal GC content (50%) with direct random base selection
- Scalability: Can generate genomes up to maximum array length per chromosome
- Thread Safety: Uses thread-local random number generation for consistent results
Randomization Quality
The tool generates high-quality random sequences that are probably repeat-free due to the extremely low probability of generating identical long sequences. The random number generation uses Java's robust pseudorandom algorithms, and the optional seed parameter allows for reproducible generation when needed for testing or benchmarking.
Use Cases
- Algorithm Testing: Generate test genomes with known characteristics for validating bioinformatics tools
- Performance Benchmarking: Create large datasets for testing tool performance and memory usage
- Simulation Studies: Provide background sequences for simulation of evolutionary processes
- Method Development: Create controlled datasets for developing new analysis methods
- Educational Purposes: Generate example genomes for teaching bioinformatics concepts
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org