SynthMDA
Generates synthetic reads by simulating Multiple Displacement Amplification (MDA) through iterative stochastic fragment selection and amplification, producing characteristic uneven coverage distributions found in single-cell genomics.
Basic Usage
synthmda.sh in=<reference> out=<reads out file>
Input may be fasta or fastq, compressed or uncompressed.
Parameters
Parameters control the MDA amplification simulation and read generation process.
Parameters
- reads=12000000
- Generate this many reads.
- paired=t
- Generate paired reads.
- length=150
- Reads should be this long.
- minlen=4000
- Min length of MDA contig.
- maxlen=150000
- Max length of MDA contig.
- cycles=9
- Number of MDA cycles; higher is more spiky. Each cycle calls amplify() method with the specified ratio parameter.
- initialratio=1.3
- Fraction of genome initially replicated in first amplification round; lower values create more spiky coverage patterns.
- ratio=1.7
- Fraction of genome replicated per cycle after initial round. Default differs from shell (1.7) vs Java code (2.0).
- refout=null
- Write MDA'd genome to this file. If null, generates temporary filename using ReadWrite.stripToCore() + random hex suffix.
- perfect=0
- Fraction of reads that will be error-free. Passed as 'perfect' parameter to RandomReads3.main().
- amp=200
- 'amp' flag sent to RandomReads3 for coverage distribution skewing (higher values create more uneven coverage).
- build=7
- Build index location for MDA'd genome. Parsed by Parser.build and passed to RandomReads3.
- prefix=null
- Generated reads will start with this prefix.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic MDA Simulation
synthmda.sh in=genome.fa out=mda_reads.fq reads=10000000
Generate 10 million synthetic reads from a reference genome with default MDA parameters.
High Coverage Spiky Distribution
synthmda.sh in=reference.fa out=reads.fq cycles=12 initialratio=1.1 amp=400
Create a more uneven coverage distribution with higher cycles and amp values.
Single-End Reads with Custom Length
synthmda.sh in=genome.fa out=reads.fq paired=f length=250 reads=5000000
Generate 5 million single-end reads of 250bp length.
Algorithm Details
SynthMDA simulates Multiple Displacement Amplification (MDA) through the amplify() method using ByteBuilder data structures and Shared.threadLocalRandom() for stochastic fragment generation.
Reference Processing
- ConcurrentReadInputStream loads the reference genome through FileFormat.testInput()
- Sequences are concatenated into a single ByteBuilder with '$' delimiters as sequence boundaries
- The concatenated genome becomes the template for iterative amplification cycles
Amplification Algorithm
The amplify() method implements the core MDA simulation:
- Initial amplification: amplify(bb, false, minlen, maxlen, initialRatio) creates the first amplification round
- Cyclic amplification: For each of the specified cycles, amplify(dest, i<1, minlen, maxlen, ratio) is called
- Fragment generation: randy.nextInt(slen) selects random start positions in the source ByteBuilder
- Length determination: minlen+randy.nextInt(range) where range=maxlen-minlen+1
- Orientation selection: Tools.nextBoolean(randy) determines forward vs reverse-complement strand
- Boundary detection: Fragment extension stops when encountering '$' delimiters
- Reverse complement: AminoAcid.baseToComplementExtended[b] performs base complementation for reverse fragments
- Size control: Maximum amplified genome size limited to 1.5GB (1500000000 bytes)
Read Generation
After amplification cycles complete, synthetic reads are generated via RandomReads3.main():
- Error model parameters: Fixed error rates configured in the ArrayList<String> parameter list
- SNP rate: 0.02 (2%) via snprate parameter
- Deletion rate: 0.005 (0.5%) via delrate parameter
- Insertion rate: 0.005 (0.5%) via insrate parameter
- N rate: 0.005 (0.5%) via nrate parameter
- Quality scores: minq=16, midq=25, maxq=38 with Phred scaling
- Error clustering: maxinss=2, maxdels=2, maxns=2, maxsnps=2 limits consecutive errors
Memory Management
- ByteBuilder retention: retain=false for genomes ≥500MB to prevent memory overflow
- Fragment length checking: Fragments shorter than minlen2 (minimum 4000bp) are rejected
- Goal calculation: Target amplification size = min(slen*ratio, 600000000) to cap memory usage
- Temporary file cleanup: Amplified reference deleted automatically when deleteRef=true
The resulting coverage distribution exhibits the characteristic uneven, spiky pattern of MDA-amplified single-cell genomes through the stochastic fragment selection and variable amplification ratios per cycle.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org