RandomReads

Script: randomreads.sh Package: synth Class: RandomReads3.java

Generates random synthetic reads from a reference genome with precise customization of insert size, mutations, quality scores, and read naming. The generated read names encode genomic origin information for automated accuracy calculations with BBMap, BBMerge, and ROC curve generation tools.

Basic Usage

randomreads.sh ref=<file> out=<file> length=<number> reads=<number>

Generate synthetic reads from a reference genome for testing, benchmarking, and algorithm validation.

Parameters

Parameters are organized by function: basic configuration, read length control, paired-end settings, synthetic mutations, quality score modeling, and specialized applications.

Basic Parameters

out=null: Output file for synthetic reads. If reads are paired and a single file name is given, output will be interleaved. For paired reads in twin files, set out1= and out2=.
ref=null: Reference genome file in FASTA format. Not needed if the reference is already indexed with the specified build ID.
build=1: Build identifier for genome indexing. If multiple references are indexed in the same directory, each needs a unique build ID.
midpad=300: Number of padding bases between scaffolds in the packed genome index structure.
reads=0: Total number of reads (or read pairs if paired=true) to generate.
coverage=-1: If positive, automatically calculate the number of reads needed to achieve this coverage depth based on genome size and read length.
overwrite=t: Allow overwriting of existing output files. Set to false to prevent accidental overwriting.
replacenoref=f: Replace N bases in the reference sequence with random nucleotides (A, C, G, T) before read generation.
simplenames=f: Generate simplified read names that clearly indicate genomic origin without BBMap internal coordinate encoding.
illuminanames=f: Use Illumina-style naming convention with matching names for paired reads, rather than location-based naming.
renamebyinsert=f: Include insert size information directly in the read name for paired-end reads.
addpairnum=f: Append ' 1:' and ' 2:' to the end of paired read names to indicate read number.
addslash=f: Append '/1' and '/2' to the end of paired read names in traditional FASTQ format.
spaceslash=f: Add a space before the slash in paired read names (e.g., ' /1' instead of '/1').
prefix=null: Use this string as a prefix for all generated read names instead of genomic coordinate-based naming.
seed=0: Random number generator seed for reproducible output. Use -1 for a random seed based on system time.

Length Parameters

minlength=150: Minimum read length in base pairs.
maxlength=150: Maximum read length in base pairs.
gaussianlength=f: Use Gaussian (bell-shaped) length distribution suitable for PacBio long reads. Otherwise uses linear distribution.
midlength=-1: Peak position for Gaussian length distribution. Must be between minlength and maxlength when gaussianlength=true.
readlengthsd=-1: Standard deviation for Gaussian read length distribution. Default is set to 1/4 of the length range (maxlength-minlength)/4.

Pairing Parameters

paired=f: Generate paired-end reads with insert size control and mate positioning.
mininsert=: Minimum insert size for paired reads. Default value depends on read length settings.
maxinsert=: Maximum insert size for paired reads. Default value depends on read length settings.
triangle=f: Use triangular insert size distribution with peak at the middle of the range.
flat=f: Use approximately flat insert size distribution across the specified range.
superflat=f: Use perfectly uniform flat insert size distribution.
gaussian=t: Use Gaussian (bell-shaped) insert size distribution with standard deviation of (maxinsert-mininsert)/6.
samestrand=f: Generate paired reads on the same DNA strand instead of opposite strands.

Mutation Parameters

snprate=0: Probability (0-1) of adding SNPs to each read. With rate X, each read has X chance of ≥1 SNP, X² chance of ≥2 SNPs, etc.
insrate=0: Probability (0-1) of adding insertions to each read using the same probabilistic model as SNPs.
delrate=0: Probability (0-1) of adding deletions to each read using the same probabilistic model as SNPs.
subrate=0: Probability (0-1) of adding contiguous substitutions to each read using the same probabilistic model.
nrate=0: Probability (0-1) of adding N-call blocks to each read using the same probabilistic model.
maxsnps=3: Maximum number of SNPs that can be added to a single read.
maxinss=2: Maximum number of insertions that can be added to a single read.
maxdels=2: Maximum number of deletions that can be added to a single read.
maxsubs=2: Maximum number of contiguous substitution blocks that can be added to a single read.
maxns=0: Maximum number of N-call blocks that can be added to a single read.
maxinslen=12: Maximum length in bases of insertion mutations.
maxdellen=400: Maximum length in bases of deletion mutations.
maxsublen=12: Maximum length in bases of contiguous substitution blocks.
maxnlen=1: Maximum length in bases of N-call blocks.
mininslen=1: Minimum length in bases of insertion mutations.
mindellen=1: Minimum length in bases of deletion mutations.
minsublen=2: Minimum length in bases of contiguous substitution blocks.
minnlen=1: Minimum length in bases of N-call blocks.

Illumina Quality Parameters

maxq=36: Maximum quality score for synthetic reads in Phred scale.
midq=28: Target average quality score for the quality distribution.
minq=20: Minimum quality score for synthetic reads in Phred scale.
q=: Set maxq, midq, and minq to the same value for uniform quality scores.
adderrors=t: Add base substitution errors based on quality scores after applying synthetic mutations.
qv=4: Quality variance to simulate tile effects and position-dependent quality variation within reads.

PacBio Quality Parameters

pacbio=f: Use PacBio error model with higher error rates instead of Illumina model. Add PacBio-style errors after mutations.
pbmin=0.13: Minimum error rate per read for PacBio error model (13%).
pbmax=0.17: Maximum error rate per read for PacBio error model (17%).

Other Parameters

overlap=1: Minimum number of bases that reads must overlap scaffold ends to be considered valid.
banns=f: Prohibit generation of reads that span reference N bases.
metagenome=f: Simulate metagenomic data by assigning scaffolds random exponential coverage levels to mimic natural abundance distributions.
randomscaffold=f: Choose scaffolds randomly without weighting by length (uniform scaffold selection).
amp=1: Simulate highly-amplified single-cell MDA data by clustering reads spatially. Set to higher values like 1000 for strong amplification bias.
pbadapter=: Add PacBio adapter sequences to reads using this literal string. Adapters are added in reverse complement orientation.
fragadapter=: Add Illumina adapter sequences to paired reads when insert size is shorter than read length. Default is standard Illumina adapter sequence.
fragadapter2=: Use this adapter sequence for read 2 in paired-end runs with short inserts. Default uses standard Illumina read 2 adapter.

Java Parameters

-Xmx: Set Java heap memory limit. Format: -Xmx20g (20GB) or -Xmx800m (800MB). Maximum is typically 85% of physical memory.
-eoom: Exit when out-of-memory exception occurs instead of trying to continue. Requires Java 8u92+.
-da: Disable Java assertions for slightly improved performance.

Examples

Basic Single-End Reads

randomreads.sh ref=genome.fa out=reads.fq reads=100000 length=150

Generate 100,000 single-end reads of 150bp from genome.fa

Paired-End with Insert Size

randomreads.sh ref=genome.fa out1=reads_1.fq out2=reads_2.fq reads=50000 length=150 paired=t mininsert=300 maxinsert=500

Generate 50,000 paired-end reads with insert sizes from 300-500bp

With Mutations and Quality Variation

randomreads.sh ref=genome.fa out=mutated_reads.fq reads=10000 length=100 snprate=0.01 maxsnps=2 minq=20 maxq=40

Generate reads with 1% SNP rate (max 2 per read) and quality scores from 20-40

Coverage-Based Generation

randomreads.sh ref=genome.fa out=coverage_reads.fq coverage=30 length=150 paired=t

Generate paired-end reads to achieve 30x coverage of the reference genome

PacBio Long Reads

randomreads.sh ref=genome.fa out=pacbio.fq reads=5000 minlength=1000 maxlength=10000 gaussianlength=t pacbio=t

Generate PacBio-style long reads with Gaussian length distribution and PacBio error model

Metagenome Simulation

randomreads.sh ref=metagenome.fa out=meta_reads.fq reads=100000 length=150 paired=t metagenome=t

Simulate metagenomic reads with natural abundance distribution across scaffolds

Algorithm Details

Random Number Generation Architecture

RandomReads3 implements independent Random objects seeded from master seed+offset to ensure reproducible parallel generation:

randy (seed+1), randy2 (seed+2): Primary and secondary position selection
randyMutationType (seed+3): Mutation type and count determination using while(random.nextFloat()<rate) loops
randyQual (seed+5): Quality score generation via QualityTools.makeQualityArray()
randyMate (seed+6), randy2Mate (seed+7): Paired-end mate positioning algorithms
randyLength (seed+21): Read length distribution using multi-tier Gaussian selection
randyAmp (seed+22): MDA amplification bias clustering simulation

Genome Index Implementation

Uses FastaToChromArrays2.main2() for packed chromosome storage with ChromosomeArray.getBytes() extraction:

Concatenated Storage: Scaffolds stored with midpad=300 bases between via MID_PADDING parameter
Direct Array Access: ChromosomeArray.array[] for O(1) random position access
Scaffold Overlap Checking: Data.scaffoldOverlapLength() enforces MIN_SCAFFOLD_OVERLAP boundaries
Length-Weighted Selection: fillRandomChrom() creates array with scaffold representation proportional to length/8192

Mutation Application Pipeline

Sequential mutation application using separate methods with precise algorithmic controls:

Count Determination: while(randyMutationType.nextFloat()<rate){count++} up to maxType limits
Length Bias Implementation: Tools.min(rand.nextInt(), rand.nextInt(), rand.nextInt()) for biasing toward shorter mutations
Position Collision Avoidance: BitSet.get()/set() prevents overlapping SNPs and N-calls when USE_UNIQUE_SNPS=true
SNP Selection Logic: AminoAcid.baseToNumber[] conversion with (oldNum+rand.nextInt(3)+1)&3 to avoid identity
Substitution Algorithm: addSUB() ensures first/last bases differ from reference, middle bases use any nucleotide

Quality Score Generation

Uses QualityTools.makeQualityArray() with specific distribution parameters:

Base Distribution: Weighted selection between minq-maxq using midq as target average
Tile Effect Simulation: qVariance parameter adds ±variance to simulate spatial quality variation
Error Introduction: addErrorsFromQuality() uses QualityTools.PROB_ERROR[] lookup table for base substitutions
PacBio Error Model: addPacBioErrors() with 40% insertion, 35% deletion, 25% substitution probability distribution

Insert Size Distribution Models

Configurable distribution algorithms for paired-end insert sizes:

BELL_DIST (default): nextGaussian()*mateMiddleDev+middle0 with rejection sampling
SUPERFLAT_DIST: Deterministic (r0.numericID%midRange)+minMiddle for reproducible uniform distribution
FLAT_DIST: randyMate.nextInt(midRange)+minMiddle for random uniform distribution
Triangular: (rand.nextInt(range)+rand.nextInt(range))/2+minMiddle for triangle distribution
EXP_DIST: Tools.exponential(randy, EXP_LAMDA) with rejection sampling d<=64

Read Length Distribution Algorithm

Multi-tier selection system in genReadLen() using probability thresholds:

Uniform (1% probability): minLen+rand.nextInt(range) for flat distribution
Biased Short (4% probability): Tools.min(rand.nextInt(range), rand.nextInt(range)) favoring shorter reads
Wide Gaussian (15% probability): nextGaussian()*readLengthDev+midLen for broader distribution
Medium Gaussian (20% probability): nextGaussian()*(readLengthDev/2)+midLen for moderate spread
Narrow Gaussian (60% probability): nextGaussian()*(readLengthDev/8)+midLen for tight clustering around midLen

Adapter Contamination Algorithm

Fragment adapter addition using addFragAdapter() when insert < read length:

Insert Size Calculation: True insert from mate pair positioning coordinates
Adapter Selection: fragadapter1/fragadapter2 arrays with Tools.toAdapters() parsing
Sequence Replacement: Direct base array modification at calculated adapter start position
Random Fill: AminoAcid.numberToBase[rand.nextInt(4)] for post-adapter positions

Specialized Generation Modes

Advanced simulation capabilities with specific algorithmic implementations:

MDA Amplification: Spatial read clustering using randyAmp for simulating whole genome amplification bias patterns
Metagenome Simulation: makeMetagenomeProbs() creates exponential abundance distribution via Tools.exponential()
Chromosome Selection: randomChrom[] array with length-proportional weighting using Data.chromLengths/8192 quantization
Custom Naming Schemes: Multiple read ID formats supporting BBMap parsecustom flag and ROC curve generation

Technical Notes

Memory Requirements

Memory usage scales with genome size and mutation complexity:

Genome Loading: ~1 byte per base for reference sequence
Mutation Tracking: BitSet arrays for preventing overlapping mutations
Location Arrays: Position tracking scales with maximum deletion size
Recommended: 2-4GB for typical bacterial genomes, 8-16GB for mammalian genomes

Performance Characteristics

Read Generation Rate: ~100,000-1,000,000 reads per minute depending on complexity
Mutation Impact: Higher mutation rates and longer mutations reduce generation speed
Paired-End Overhead: Minimal additional cost for mate pair generation
I/O Limitations: Output compression can become bottleneck for very high throughput

Output Compatibility

Generated reads are compatible with standard bioinformatics tools:

FASTQ Format: Standard format with optional quality compression
Read Names: Configurable naming schemes for different analysis pipelines
Paired-End Support: Interleaved or separate file output options
Quality Encoding: Standard Phred+33 quality encoding

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org