RandomReads
Generates random synthetic reads from a reference genome with precise customization of insert size, mutations, quality scores, and read naming. The generated read names encode genomic origin information for automated accuracy calculations with BBMap, BBMerge, and ROC curve generation tools.
Basic Usage
randomreads.sh ref=<file> out=<file> length=<number> reads=<number>
Generate synthetic reads from a reference genome for testing, benchmarking, and algorithm validation.
Parameters
Parameters are organized by function: basic configuration, read length control, paired-end settings, synthetic mutations, quality score modeling, and specialized applications.
Basic Parameters
- out=null
- Output file for synthetic reads. If reads are paired and a single file name is given, output will be interleaved. For paired reads in twin files, set out1= and out2=.
- ref=null
- Reference genome file in FASTA format. Not needed if the reference is already indexed with the specified build ID.
- build=1
- Build identifier for genome indexing. If multiple references are indexed in the same directory, each needs a unique build ID.
- midpad=300
- Number of padding bases between scaffolds in the packed genome index structure.
- reads=0
- Total number of reads (or read pairs if paired=true) to generate.
- coverage=-1
- If positive, automatically calculate the number of reads needed to achieve this coverage depth based on genome size and read length.
- overwrite=t
- Allow overwriting of existing output files. Set to false to prevent accidental overwriting.
- replacenoref=f
- Replace N bases in the reference sequence with random nucleotides (A, C, G, T) before read generation.
- simplenames=f
- Generate simplified read names that clearly indicate genomic origin without BBMap internal coordinate encoding.
- illuminanames=f
- Use Illumina-style naming convention with matching names for paired reads, rather than location-based naming.
- renamebyinsert=f
- Include insert size information directly in the read name for paired-end reads.
- addpairnum=f
- Append ' 1:' and ' 2:' to the end of paired read names to indicate read number.
- addslash=f
- Append '/1' and '/2' to the end of paired read names in traditional FASTQ format.
- spaceslash=f
- Add a space before the slash in paired read names (e.g., ' /1' instead of '/1').
- prefix=null
- Use this string as a prefix for all generated read names instead of genomic coordinate-based naming.
- seed=0
- Random number generator seed for reproducible output. Use -1 for a random seed based on system time.
Length Parameters
- minlength=150
- Minimum read length in base pairs.
- maxlength=150
- Maximum read length in base pairs.
- gaussianlength=f
- Use Gaussian (bell-shaped) length distribution suitable for PacBio long reads. Otherwise uses linear distribution.
- midlength=-1
- Peak position for Gaussian length distribution. Must be between minlength and maxlength when gaussianlength=true.
- readlengthsd=-1
- Standard deviation for Gaussian read length distribution. Default is set to 1/4 of the length range (maxlength-minlength)/4.
Pairing Parameters
- paired=f
- Generate paired-end reads with insert size control and mate positioning.
- mininsert=
- Minimum insert size for paired reads. Default value depends on read length settings.
- maxinsert=
- Maximum insert size for paired reads. Default value depends on read length settings.
- triangle=f
- Use triangular insert size distribution with peak at the middle of the range.
- flat=f
- Use approximately flat insert size distribution across the specified range.
- superflat=f
- Use perfectly uniform flat insert size distribution.
- gaussian=t
- Use Gaussian (bell-shaped) insert size distribution with standard deviation of (maxinsert-mininsert)/6.
- samestrand=f
- Generate paired reads on the same DNA strand instead of opposite strands.
Mutation Parameters
- snprate=0
- Probability (0-1) of adding SNPs to each read. With rate X, each read has X chance of ≥1 SNP, X² chance of ≥2 SNPs, etc.
- insrate=0
- Probability (0-1) of adding insertions to each read using the same probabilistic model as SNPs.
- delrate=0
- Probability (0-1) of adding deletions to each read using the same probabilistic model as SNPs.
- subrate=0
- Probability (0-1) of adding contiguous substitutions to each read using the same probabilistic model.
- nrate=0
- Probability (0-1) of adding N-call blocks to each read using the same probabilistic model.
- maxsnps=3
- Maximum number of SNPs that can be added to a single read.
- maxinss=2
- Maximum number of insertions that can be added to a single read.
- maxdels=2
- Maximum number of deletions that can be added to a single read.
- maxsubs=2
- Maximum number of contiguous substitution blocks that can be added to a single read.
- maxns=0
- Maximum number of N-call blocks that can be added to a single read.
- maxinslen=12
- Maximum length in bases of insertion mutations.
- maxdellen=400
- Maximum length in bases of deletion mutations.
- maxsublen=12
- Maximum length in bases of contiguous substitution blocks.
- maxnlen=1
- Maximum length in bases of N-call blocks.
- mininslen=1
- Minimum length in bases of insertion mutations.
- mindellen=1
- Minimum length in bases of deletion mutations.
- minsublen=2
- Minimum length in bases of contiguous substitution blocks.
- minnlen=1
- Minimum length in bases of N-call blocks.
Illumina Quality Parameters
- maxq=36
- Maximum quality score for synthetic reads in Phred scale.
- midq=28
- Target average quality score for the quality distribution.
- minq=20
- Minimum quality score for synthetic reads in Phred scale.
- q=
- Set maxq, midq, and minq to the same value for uniform quality scores.
- adderrors=t
- Add base substitution errors based on quality scores after applying synthetic mutations.
- qv=4
- Quality variance to simulate tile effects and position-dependent quality variation within reads.
PacBio Quality Parameters
- pacbio=f
- Use PacBio error model with higher error rates instead of Illumina model. Add PacBio-style errors after mutations.
- pbmin=0.13
- Minimum error rate per read for PacBio error model (13%).
- pbmax=0.17
- Maximum error rate per read for PacBio error model (17%).
Other Parameters
- overlap=1
- Minimum number of bases that reads must overlap scaffold ends to be considered valid.
- banns=f
- Prohibit generation of reads that span reference N bases.
- metagenome=f
- Simulate metagenomic data by assigning scaffolds random exponential coverage levels to mimic natural abundance distributions.
- randomscaffold=f
- Choose scaffolds randomly without weighting by length (uniform scaffold selection).
- amp=1
- Simulate highly-amplified single-cell MDA data by clustering reads spatially. Set to higher values like 1000 for strong amplification bias.
- pbadapter=
- Add PacBio adapter sequences to reads using this literal string. Adapters are added in reverse complement orientation.
- fragadapter=
- Add Illumina adapter sequences to paired reads when insert size is shorter than read length. Default is standard Illumina adapter sequence.
- fragadapter2=
- Use this adapter sequence for read 2 in paired-end runs with short inserts. Default uses standard Illumina read 2 adapter.
Java Parameters
- -Xmx
- Set Java heap memory limit. Format: -Xmx20g (20GB) or -Xmx800m (800MB). Maximum is typically 85% of physical memory.
- -eoom
- Exit when out-of-memory exception occurs instead of trying to continue. Requires Java 8u92+.
- -da
- Disable Java assertions for slightly improved performance.
Examples
Basic Single-End Reads
randomreads.sh ref=genome.fa out=reads.fq reads=100000 length=150
Generate 100,000 single-end reads of 150bp from genome.fa
Paired-End with Insert Size
randomreads.sh ref=genome.fa out1=reads_1.fq out2=reads_2.fq reads=50000 length=150 paired=t mininsert=300 maxinsert=500
Generate 50,000 paired-end reads with insert sizes from 300-500bp
With Mutations and Quality Variation
randomreads.sh ref=genome.fa out=mutated_reads.fq reads=10000 length=100 snprate=0.01 maxsnps=2 minq=20 maxq=40
Generate reads with 1% SNP rate (max 2 per read) and quality scores from 20-40
Coverage-Based Generation
randomreads.sh ref=genome.fa out=coverage_reads.fq coverage=30 length=150 paired=t
Generate paired-end reads to achieve 30x coverage of the reference genome
PacBio Long Reads
randomreads.sh ref=genome.fa out=pacbio.fq reads=5000 minlength=1000 maxlength=10000 gaussianlength=t pacbio=t
Generate PacBio-style long reads with Gaussian length distribution and PacBio error model
Metagenome Simulation
randomreads.sh ref=metagenome.fa out=meta_reads.fq reads=100000 length=150 paired=t metagenome=t
Simulate metagenomic reads with natural abundance distribution across scaffolds
Algorithm Details
Random Number Generation Architecture
RandomReads3 implements independent Random objects seeded from master seed+offset to ensure reproducible parallel generation:
- randy (seed+1), randy2 (seed+2): Primary and secondary position selection
- randyMutationType (seed+3): Mutation type and count determination using while(random.nextFloat()<rate) loops
- randyQual (seed+5): Quality score generation via QualityTools.makeQualityArray()
- randyMate (seed+6), randy2Mate (seed+7): Paired-end mate positioning algorithms
- randyLength (seed+21): Read length distribution using multi-tier Gaussian selection
- randyAmp (seed+22): MDA amplification bias clustering simulation
Genome Index Implementation
Uses FastaToChromArrays2.main2() for packed chromosome storage with ChromosomeArray.getBytes() extraction:
- Concatenated Storage: Scaffolds stored with midpad=300 bases between via MID_PADDING parameter
- Direct Array Access: ChromosomeArray.array[] for O(1) random position access
- Scaffold Overlap Checking: Data.scaffoldOverlapLength() enforces MIN_SCAFFOLD_OVERLAP boundaries
- Length-Weighted Selection: fillRandomChrom() creates array with scaffold representation proportional to length/8192
Mutation Application Pipeline
Sequential mutation application using separate methods with precise algorithmic controls:
- Count Determination: while(randyMutationType.nextFloat()<rate){count++} up to maxType limits
- Length Bias Implementation: Tools.min(rand.nextInt(), rand.nextInt(), rand.nextInt()) for biasing toward shorter mutations
- Position Collision Avoidance: BitSet.get()/set() prevents overlapping SNPs and N-calls when USE_UNIQUE_SNPS=true
- SNP Selection Logic: AminoAcid.baseToNumber[] conversion with (oldNum+rand.nextInt(3)+1)&3 to avoid identity
- Substitution Algorithm: addSUB() ensures first/last bases differ from reference, middle bases use any nucleotide
Quality Score Generation
Uses QualityTools.makeQualityArray() with specific distribution parameters:
- Base Distribution: Weighted selection between minq-maxq using midq as target average
- Tile Effect Simulation: qVariance parameter adds ±variance to simulate spatial quality variation
- Error Introduction: addErrorsFromQuality() uses QualityTools.PROB_ERROR[] lookup table for base substitutions
- PacBio Error Model: addPacBioErrors() with 40% insertion, 35% deletion, 25% substitution probability distribution
Insert Size Distribution Models
Configurable distribution algorithms for paired-end insert sizes:
- BELL_DIST (default): nextGaussian()*mateMiddleDev+middle0 with rejection sampling
- SUPERFLAT_DIST: Deterministic (r0.numericID%midRange)+minMiddle for reproducible uniform distribution
- FLAT_DIST: randyMate.nextInt(midRange)+minMiddle for random uniform distribution
- Triangular: (rand.nextInt(range)+rand.nextInt(range))/2+minMiddle for triangle distribution
- EXP_DIST: Tools.exponential(randy, EXP_LAMDA) with rejection sampling d<=64
Read Length Distribution Algorithm
Multi-tier selection system in genReadLen() using probability thresholds:
- Uniform (1% probability): minLen+rand.nextInt(range) for flat distribution
- Biased Short (4% probability): Tools.min(rand.nextInt(range), rand.nextInt(range)) favoring shorter reads
- Wide Gaussian (15% probability): nextGaussian()*readLengthDev+midLen for broader distribution
- Medium Gaussian (20% probability): nextGaussian()*(readLengthDev/2)+midLen for moderate spread
- Narrow Gaussian (60% probability): nextGaussian()*(readLengthDev/8)+midLen for tight clustering around midLen
Adapter Contamination Algorithm
Fragment adapter addition using addFragAdapter() when insert < read length:
- Insert Size Calculation: True insert from mate pair positioning coordinates
- Adapter Selection: fragadapter1/fragadapter2 arrays with Tools.toAdapters() parsing
- Sequence Replacement: Direct base array modification at calculated adapter start position
- Random Fill: AminoAcid.numberToBase[rand.nextInt(4)] for post-adapter positions
Specialized Generation Modes
Advanced simulation capabilities with specific algorithmic implementations:
- MDA Amplification: Spatial read clustering using randyAmp for simulating whole genome amplification bias patterns
- Metagenome Simulation: makeMetagenomeProbs() creates exponential abundance distribution via Tools.exponential()
- Chromosome Selection: randomChrom[] array with length-proportional weighting using Data.chromLengths/8192 quantization
- Custom Naming Schemes: Multiple read ID formats supporting BBMap parsecustom flag and ROC curve generation
Technical Notes
Memory Requirements
Memory usage scales with genome size and mutation complexity:
- Genome Loading: ~1 byte per base for reference sequence
- Mutation Tracking: BitSet arrays for preventing overlapping mutations
- Location Arrays: Position tracking scales with maximum deletion size
- Recommended: 2-4GB for typical bacterial genomes, 8-16GB for mammalian genomes
Performance Characteristics
- Read Generation Rate: ~100,000-1,000,000 reads per minute depending on complexity
- Mutation Impact: Higher mutation rates and longer mutations reduce generation speed
- Paired-End Overhead: Minimal additional cost for mate pair generation
- I/O Limitations: Output compression can become bottleneck for very high throughput
Output Compatibility
Generated reads are compatible with standard bioinformatics tools:
- FASTQ Format: Standard format with optional quality compression
- Read Names: Configurable naming schemes for different analysis pipelines
- Paired-End Support: Interleaved or separate file output options
- Quality Encoding: Standard Phred+33 quality encoding
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org