RandomReads

Script: randomreads.sh Package: synth Class: RandomReads3.java

Generates random synthetic reads from a reference genome with precise customization of insert size, mutations, quality scores, and read naming. The generated read names encode genomic origin information for automated accuracy calculations with BBMap, BBMerge, and ROC curve generation tools.

Basic Usage

randomreads.sh ref=<file> out=<file> length=<number> reads=<number>

Generate synthetic reads from a reference genome for testing, benchmarking, and algorithm validation.

Parameters

Parameters are organized by function: basic configuration, read length control, paired-end settings, synthetic mutations, quality score modeling, and specialized applications.

Basic Parameters

out=null
Output file for synthetic reads. If reads are paired and a single file name is given, output will be interleaved. For paired reads in twin files, set out1= and out2=.
ref=null
Reference genome file in FASTA format. Not needed if the reference is already indexed with the specified build ID.
build=1
Build identifier for genome indexing. If multiple references are indexed in the same directory, each needs a unique build ID.
midpad=300
Number of padding bases between scaffolds in the packed genome index structure.
reads=0
Total number of reads (or read pairs if paired=true) to generate.
coverage=-1
If positive, automatically calculate the number of reads needed to achieve this coverage depth based on genome size and read length.
overwrite=t
Allow overwriting of existing output files. Set to false to prevent accidental overwriting.
replacenoref=f
Replace N bases in the reference sequence with random nucleotides (A, C, G, T) before read generation.
simplenames=f
Generate simplified read names that clearly indicate genomic origin without BBMap internal coordinate encoding.
illuminanames=f
Use Illumina-style naming convention with matching names for paired reads, rather than location-based naming.
renamebyinsert=f
Include insert size information directly in the read name for paired-end reads.
addpairnum=f
Append ' 1:' and ' 2:' to the end of paired read names to indicate read number.
addslash=f
Append '/1' and '/2' to the end of paired read names in traditional FASTQ format.
spaceslash=f
Add a space before the slash in paired read names (e.g., ' /1' instead of '/1').
prefix=null
Use this string as a prefix for all generated read names instead of genomic coordinate-based naming.
seed=0
Random number generator seed for reproducible output. Use -1 for a random seed based on system time.

Length Parameters

minlength=150
Minimum read length in base pairs.
maxlength=150
Maximum read length in base pairs.
gaussianlength=f
Use Gaussian (bell-shaped) length distribution suitable for PacBio long reads. Otherwise uses linear distribution.
midlength=-1
Peak position for Gaussian length distribution. Must be between minlength and maxlength when gaussianlength=true.
readlengthsd=-1
Standard deviation for Gaussian read length distribution. Default is set to 1/4 of the length range (maxlength-minlength)/4.

Pairing Parameters

paired=f
Generate paired-end reads with insert size control and mate positioning.
mininsert=
Minimum insert size for paired reads. Default value depends on read length settings.
maxinsert=
Maximum insert size for paired reads. Default value depends on read length settings.
triangle=f
Use triangular insert size distribution with peak at the middle of the range.
flat=f
Use approximately flat insert size distribution across the specified range.
superflat=f
Use perfectly uniform flat insert size distribution.
gaussian=t
Use Gaussian (bell-shaped) insert size distribution with standard deviation of (maxinsert-mininsert)/6.
samestrand=f
Generate paired reads on the same DNA strand instead of opposite strands.

Mutation Parameters

snprate=0
Probability (0-1) of adding SNPs to each read. With rate X, each read has X chance of ≥1 SNP, X² chance of ≥2 SNPs, etc.
insrate=0
Probability (0-1) of adding insertions to each read using the same probabilistic model as SNPs.
delrate=0
Probability (0-1) of adding deletions to each read using the same probabilistic model as SNPs.
subrate=0
Probability (0-1) of adding contiguous substitutions to each read using the same probabilistic model.
nrate=0
Probability (0-1) of adding N-call blocks to each read using the same probabilistic model.
maxsnps=3
Maximum number of SNPs that can be added to a single read.
maxinss=2
Maximum number of insertions that can be added to a single read.
maxdels=2
Maximum number of deletions that can be added to a single read.
maxsubs=2
Maximum number of contiguous substitution blocks that can be added to a single read.
maxns=0
Maximum number of N-call blocks that can be added to a single read.
maxinslen=12
Maximum length in bases of insertion mutations.
maxdellen=400
Maximum length in bases of deletion mutations.
maxsublen=12
Maximum length in bases of contiguous substitution blocks.
maxnlen=1
Maximum length in bases of N-call blocks.
mininslen=1
Minimum length in bases of insertion mutations.
mindellen=1
Minimum length in bases of deletion mutations.
minsublen=2
Minimum length in bases of contiguous substitution blocks.
minnlen=1
Minimum length in bases of N-call blocks.

Illumina Quality Parameters

maxq=36
Maximum quality score for synthetic reads in Phred scale.
midq=28
Target average quality score for the quality distribution.
minq=20
Minimum quality score for synthetic reads in Phred scale.
q=
Set maxq, midq, and minq to the same value for uniform quality scores.
adderrors=t
Add base substitution errors based on quality scores after applying synthetic mutations.
qv=4
Quality variance to simulate tile effects and position-dependent quality variation within reads.

PacBio Quality Parameters

pacbio=f
Use PacBio error model with higher error rates instead of Illumina model. Add PacBio-style errors after mutations.
pbmin=0.13
Minimum error rate per read for PacBio error model (13%).
pbmax=0.17
Maximum error rate per read for PacBio error model (17%).

Other Parameters

overlap=1
Minimum number of bases that reads must overlap scaffold ends to be considered valid.
banns=f
Prohibit generation of reads that span reference N bases.
metagenome=f
Simulate metagenomic data by assigning scaffolds random exponential coverage levels to mimic natural abundance distributions.
randomscaffold=f
Choose scaffolds randomly without weighting by length (uniform scaffold selection).
amp=1
Simulate highly-amplified single-cell MDA data by clustering reads spatially. Set to higher values like 1000 for strong amplification bias.
pbadapter=
Add PacBio adapter sequences to reads using this literal string. Adapters are added in reverse complement orientation.
fragadapter=
Add Illumina adapter sequences to paired reads when insert size is shorter than read length. Default is standard Illumina adapter sequence.
fragadapter2=
Use this adapter sequence for read 2 in paired-end runs with short inserts. Default uses standard Illumina read 2 adapter.

Java Parameters

-Xmx
Set Java heap memory limit. Format: -Xmx20g (20GB) or -Xmx800m (800MB). Maximum is typically 85% of physical memory.
-eoom
Exit when out-of-memory exception occurs instead of trying to continue. Requires Java 8u92+.
-da
Disable Java assertions for slightly improved performance.

Examples

Basic Single-End Reads

randomreads.sh ref=genome.fa out=reads.fq reads=100000 length=150

Generate 100,000 single-end reads of 150bp from genome.fa

Paired-End with Insert Size

randomreads.sh ref=genome.fa out1=reads_1.fq out2=reads_2.fq reads=50000 length=150 paired=t mininsert=300 maxinsert=500

Generate 50,000 paired-end reads with insert sizes from 300-500bp

With Mutations and Quality Variation

randomreads.sh ref=genome.fa out=mutated_reads.fq reads=10000 length=100 snprate=0.01 maxsnps=2 minq=20 maxq=40

Generate reads with 1% SNP rate (max 2 per read) and quality scores from 20-40

Coverage-Based Generation

randomreads.sh ref=genome.fa out=coverage_reads.fq coverage=30 length=150 paired=t

Generate paired-end reads to achieve 30x coverage of the reference genome

PacBio Long Reads

randomreads.sh ref=genome.fa out=pacbio.fq reads=5000 minlength=1000 maxlength=10000 gaussianlength=t pacbio=t

Generate PacBio-style long reads with Gaussian length distribution and PacBio error model

Metagenome Simulation

randomreads.sh ref=metagenome.fa out=meta_reads.fq reads=100000 length=150 paired=t metagenome=t

Simulate metagenomic reads with natural abundance distribution across scaffolds

Algorithm Details

Random Number Generation Architecture

RandomReads3 implements independent Random objects seeded from master seed+offset to ensure reproducible parallel generation:

Genome Index Implementation

Uses FastaToChromArrays2.main2() for packed chromosome storage with ChromosomeArray.getBytes() extraction:

Mutation Application Pipeline

Sequential mutation application using separate methods with precise algorithmic controls:

  1. Count Determination: while(randyMutationType.nextFloat()<rate){count++} up to maxType limits
  2. Length Bias Implementation: Tools.min(rand.nextInt(), rand.nextInt(), rand.nextInt()) for biasing toward shorter mutations
  3. Position Collision Avoidance: BitSet.get()/set() prevents overlapping SNPs and N-calls when USE_UNIQUE_SNPS=true
  4. SNP Selection Logic: AminoAcid.baseToNumber[] conversion with (oldNum+rand.nextInt(3)+1)&3 to avoid identity
  5. Substitution Algorithm: addSUB() ensures first/last bases differ from reference, middle bases use any nucleotide

Quality Score Generation

Uses QualityTools.makeQualityArray() with specific distribution parameters:

Insert Size Distribution Models

Configurable distribution algorithms for paired-end insert sizes:

Read Length Distribution Algorithm

Multi-tier selection system in genReadLen() using probability thresholds:

Adapter Contamination Algorithm

Fragment adapter addition using addFragAdapter() when insert < read length:

Specialized Generation Modes

Advanced simulation capabilities with specific algorithmic implementations:

Technical Notes

Memory Requirements

Memory usage scales with genome size and mutation complexity:

Performance Characteristics

Output Compatibility

Generated reads are compatible with standard bioinformatics tools:

Support

For questions and support: