RandomReadsMG

Script: randomreadsmg.sh Package: synth Class: RandomReadsMG.java

Generates synthetic metagenomic sequencing reads from reference genomes with realistic coverage patterns, error profiles, and taxonomic diversity. Supports multiple sequencing platforms (Illumina, PacBio, ONT) with platform-specific read length distributions, error rates, and quality score modeling.

Basic Usage

randomreadsmg.sh *.fa out=reads.fq.gz

Generate synthetic reads from all FASTA assemblies with random coverage levels.

randomreadsmg.sh ecoli.fa=40 mruber.fa=0.1 phix.fa=10 adderrors out=reads.fq.gz

Generate synthetic reads with custom coverage levels and realistic errors for specific genomes.

randomreadsmg.sh in=M.ruber.fa.gz reads=20k adderrors pacbio out=pacbio_reads.fq

Generate exactly 20,000 PacBio reads with errors (demonstrates 'k' suffix support for read counts).

Parameters

Parameters are organized by their function in the synthetic read generation process. Each assembly can be assigned random or custom coverage levels, and reads contain taxonomic information when filenames follow the 'tid_x_' convention.

File parameters

in=<file,file>: Assembly input. Can be a single file, a directory of files, or comma-delimited list. Unrecognized arguments with no '=' sign will also be treated as input files.
out=<file>: Synthetic read output destination.
out2=<file>: Read 2 output if twin files are desired for paired reads.

Processing parameters

mindepth=1: Minimum assembly average depth.
maxdepth=256: Maximum assembly average depth.
depth=: Sets minimum and maximum to the same level.
reads=-1: If positive, ignore depth and make this many reads per contig.
variance=0.5: Coverage within an assembly will vary by up to this much; one region can be up to this fraction deeper than another.
mode=min4: Random depth distribution; can be min4, exp, root, or linear.
cov_x=: Set a custom coverage level for the file named x. x can alternatively be the taxID if the filename starts with tid_x_; e.g. cov_foo.fa=5 for foo.fa, or cov_7=5 for file tid_7_foo.fa
<file>=x: Alternate way to set custom depth; file will get depth x.
threads=: Set the max number of threads; default is logical core count. By default each input file uses 1 thread. This flag will also force multithreaded processing when there is exactly 1 input file, increasing speed for a complex simulation.
seed=-1: If positive, use the specified RNG seed. This will cause deterministic output if threads=1.

Artifact parameters

pcr=0.0: Add PCR duplicates at this rate (0-1).
randomkmer=f: Bias read start sites with random kmer priming.
kprime=6: Length for random kmer priming.
kpower=0.5: Raise linear primer distribution to this power (>0). Higher powers increase priming bias.
minkprob=0.1: Minimum primer kmer probability.

Platform parameters

illumina: Use Illumina length and error mode (default).
pacbio: Use PacBio length and error mode.
ont: Use ONT length and error mode.
paired=true: Generate paired reads in Illumina mode.
length=150: Read length; default is 150 for Illumina mode.
avginsert=300: Average insert size; only affects paired reads.

Long read parameters

minlen=1000: Minimum read length for PacBio/ONT modes.
meanlen=15000: Mean read length for PacBio/ONT modes.
maxlen=100000: Max read length for PacBio/ONT modes.
tailfactor=0.2: Controls heavy tail for ONT length distribution.
pbsigma=0.5: Log-normal standard deviation for PacBio length distribution.

Error parameters (all platforms)

adderrors=f: Set to true to add model-specific errors.
subrate=0.0: Add substitutions at this rate, independent of platform models.
indelrate=0.0: Add length-1 indels at this rate, independent of platform models.

Illumina-specific parameters

qavg=25: Average quality score, for generating Illumina errors.
qrange=0: Quality score range (+/- this much).
addadapters: Add adapter sequence to paired reads with insert size shorter than read length.
adapter1=: Optionally specify a custom R1 adapter (as observed in R1).
adapter2=: Optionally specify a custom R2 adapter (as observed in R2).

Long-read error parameters

srate=-1: Substitution rate; default 0.0025 ONT / 0.00015 PB.
irate=-1: Insertion rate; default 0.0055 ONT / 0.000055 PB.
drate=-1: Deletion rate; default 0.0045 ONT / 0.000045 PB.
hrate=-1: Homopolymer error boost; default 0.02 ONT / 0.000015 PB. The indel chance increases this much per homopolymer base.

Coverage variation parameters (only used with 'sinewave' flag)

sinewave: Enable realistic coverage variation within contigs.
numwaves=4: Number of sine waves to combine; more waves create more complex coverage patterns with irregular peaks and valleys.
waveamp=0.70: Controls the maximum variation in coverage due to the sine waves. Higher values (0-1) create more dramatic differences between high and low coverage regions.
oribias=0.25: Strength of the origin of replication bias. Controls the max linear decrease in coverage from start to end of contigs.
minprob=0.10: Sets the minimum coverage probability as a fraction of target. Makes it improbable for regions have coverage that drops below this level, preventing assembly gaps.
minperiod=2k: Minimum sine wave period, in bp.
maxperiod=80k: Maximum sine wave period, in bp.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Synthetic Read Generation

# Generate Illumina paired-end reads from multiple assemblies
randomreadsmg.sh *.fa out=synthetic_reads.fq.gz

# Generate single-end reads with custom depth and errors
randomreadsmg.sh input.fa depth=50 paired=f adderrors out=single_reads.fq

Basic metagenomic read simulation with default Illumina parameters.

Platform-Specific Generation

# Generate exactly 20,000 PacBio reads with errors
randomreadsmg.sh in=M.ruber.fa.gz reads=20k adderrors pacbio out=pacbio_reads.fq

# Generate ONT reads with realistic error profile
randomreadsmg.sh genomes.fa ont meanlen=20000 adderrors=t out=ont_reads.fq.gz

Platform-specific read generation with appropriate length distributions and error models.

Custom Coverage and Error Modeling

# Custom coverage with specific genomes at different abundances
randomreadsmg.sh ecoli.fa=100 staph.fa=10 env_sample.fa=1 out=custom_coverage.fq.gz

# Add realistic coverage variation and PCR duplicates
randomreadsmg.sh genomes.fa sinewave numwaves=6 waveamp=0.8 pcr=0.05 out=realistic_reads.fq.gz

Simulation with custom abundance patterns and realistic artifacts including sine wave coverage variation and PCR duplicates.

Taxonomic Classification Setup

# Generate reads from taxonomically labeled genomes
randomreadsmg.sh tid_511145_ecoli_k12.fa tid_1280_staph_aureus.fa out=tax_reads.fq.gz

# Custom coverage by taxonomy ID
randomreadsmg.sh tid_*.fa cov_511145=50 cov_1280=25 out=custom_tax_reads.fq.gz

Generate synthetic reads with proper taxonomic labels for benchmarking metagenomic classifiers.

Multi-threaded Processing

# Process large dataset with multiple threads
randomreadsmg.sh large_genomes/ threads=16 seed=12345 out=parallel_reads.fq.gz

# Single complex genome with multi-threading
randomreadsmg.sh complex_genome.fa threads=8 sinewave numwaves=10 out=complex_reads.fq.gz

Efficient processing of large reference datasets with deterministic output.

Algorithm Details

Coverage Distribution Models

RandomReadsMG implements four coverage distribution modes to model realistic metagenomic abundance patterns:

MIN4 (default): Takes the minimum of 4 random square-root values, creating a heavily skewed distribution toward low-abundance organisms typical in environmental samples.
EXP: Uses exponential distribution with lambda=1/√(min_depth×max_depth), producing natural long-tailed abundance patterns.
ROOT: Square-root distribution providing moderate skew toward lower abundances.
LINEAR: Uniform distribution across the depth range for even coverage testing.

Platform-Specific Error Modeling

Each sequencing platform uses distinct error characteristics:

Illumina Error Model

Quality-based error introduction using PHRED scores. Generates random quality scores within the specified range (qavg ± qrange), then introduces substitutions probabilistically based on quality-to-error probability conversion. Adapter contamination is modeled when insert sizes are shorter than read length.

PacBio Error Model

Uses log-normal read length distribution with configurable sigma parameter (default 0.5). Error rates: substitutions 0.00015, insertions/deletions 0.000055/0.000045, homopolymer bonus 0.000015. Length generation applies bias correction: μ = log(mean) - 0.5σ².

ONT Error Model

Mixed distribution approach combining log-normal core with exponential heavy tail (controlled by tailfactor=0.2). Higher error rates: substitutions 0.0025, insertions/deletions 0.0055/0.0045, homopolymer bonus 0.02. Models the bimodal length patterns typical of ONT sequencing.

Sine Wave Coverage Modeling

The sinewave flag enables within-contig coverage variation modeling with realistic spatial bias patterns:

Multiple sine waves: Combines numwaves overlapping sine waves with random periods between minperiod and maxperiod.
Origin bias: Models replication origin effects with linear coverage decrease controlled by oribias parameter.
Coverage bounds: minprob prevents coverage from dropping below a threshold, avoiding unrealistic assembly gaps.
Wave amplitude: waveamp controls the maximum variation strength, with values near 0.7-0.8 producing realistic patterns.

Multi-threaded Architecture

ProcessThread workers use atomic coordination for work distribution. Each thread maintains independent random number generators seeded deterministically (seed + thread_id) for reproducible results. Thread-local statistics are accumulated atomically at completion. Single-file multi-threading splits depth targets across threads while maintaining total coverage.

Taxonomic Integration

Automatically parses taxonomy IDs from filenames following 'tid_x_' convention or read headers. Integrates with TaxTree for proper species classification. Read headers encode source information: f_[file]_c_[contig]_s_[strand]_p_[position]_i_[insert]_d_[duplicate]_tid_[taxid]. Enables benchmarking of metagenomic analysis tools with ground-truth data.

PCR Duplicate Modeling

Caches read start positions, insert sizes, and strand orientations to generate realistic PCR duplicates at the specified rate. Duplicate reads maintain identical genomic coordinates while receiving distinct read numbers. Single-end and paired-end duplicates are handled appropriately for each platform.

Notes

Reads headers contain the TaxID of the originating genome if the filename starts with 'tid_x_' where x is a positive integer.
Custom coverage can be set per file using cov_filename=depth or filename=depth syntax.
The tool supports automatic file format detection and compression (gzip, bzip2).
Memory usage scales with the number of concurrent genomes (maxConcurrentGenomes parameter).
For deterministic output, use seed parameter with threads=1 or ensure consistent thread counts.
Adapter contamination only occurs when insert size is shorter than read length in paired mode.
Long-read error rates can be overridden for any platform, including Illumina.
Coverage variation (variance parameter) is applied unless sine wave modeling is enabled.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org