IceCreamMaker
Generates synthetic PacBio reads to mimic the chimeric inverted repeats from 'triangle reads', aka 'ice cream cones' - reads missing one adapter.
Basic Usage
icecreammaker.sh in=<file> out=<file> reads=100k minlen=500 maxlen=5000
This tool simulates PacBio sequencing artifacts where adapter sequences are missing or hidden, creating chimeric reads with inverted repeats. The "ice cream cone" name refers to the triangular shape these reads form when plotted, resembling ice cream cones.
Parameters
Parameters control various aspects of synthetic read generation, including sequence properties, error simulation, and ice cream cone artifact simulation. Length parameters distinguish between subread lengths (genomic sequence) and movie lengths (full sequencing output with concatenated subreads).
Standard parameters
- in=<file>
- A reference genome fasta (optional). If not provided, a synthetic genome will be generated using the specified GC content and genome size.
- out=<file>
- Synthetic read output file. Reads are generated in FASTQ format with simulated quality scores.
- idhist=<file>
- Identity histogram output file. Writes a histogram showing the distribution of sequence identities (accuracy levels) across generated reads.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default behavior prevents accidental data loss.
- ziplevel=2
- (zl) Set compression level from 1 (lowest/fastest) through 9 (maximum/slowest). Lower compression reduces CPU time at the cost of larger file sizes.
Length parameters
- minlen=500
- (minlength) Minimum length of genomic sequence molecules in base pairs. This controls the size of DNA fragments being sequenced, not the final read length.
- maxlen=5000
- (maxlength) Maximum length of genomic sequence molecules in base pairs. Fragment lengths are randomly selected between minlen and maxlen using a biased distribution favoring shorter fragments.
- len=
- (length) Set minlen and maxlen to the same number for uniform fragment lengths. Useful for controlled simulations where fragment size variation should be eliminated.
- minmovie=500
- (minmov) Minimum length of sequencing movies in base pairs. Movies represent the total sequencing output including multiple passes of the same fragment plus adapter sequences.
- maxmovie=40k
- (maxmov) Maximum length of sequencing movies. Longer movies allow more passes through the same fragment, potentially improving consensus accuracy but taking more sequencing time.
- movie=
- (mov) Set minmov and maxmov to the same number for uniform movie lengths. Creates consistent sequencing time per ZMW.
Ice cream parameters
- missingrate=0
- (missing) Fraction of reads missing an adapter sequence (0.0-1.0). Missing adapters cause the polymerase to continue through hairpin structures, creating chimeric reads with inverted repeats. This simulates the primary ice cream cone artifact.
- hiddenrate=0
- (hidden) Fraction of adapter sequences that are present but not detected by base calling or downstream processing (0.0-1.0). Hidden adapters can cause similar artifacts to missing adapters in downstream analysis.
- bothends=f
- Allow missing or hidden adapters on both ends of the same fragment. Currently not fully implemented (marked as TODO in source code). When true, would create more complex chimeric structures.
Other parameters
- zmws
- (reads) Number of Zero Mode Wavelength holes (ZMWs) to generate. Each ZMW produces multiple subreads from the same DNA fragment. This parameter controls the total number of sequencing reactions, not the total number of output reads.
- ccs=f
- Make Circular Consensus Sequence (CCS) reads producing one consensus read per ZMW from full pass reads only. When enabled, multiple subreads are combined into a single high-accuracy consensus sequence. Error rate parameters still apply to simulate imperfect consensus.
- gc=0.6
- GC content fraction (0.0-1.0) for synthetic genome generation. Only used when no input reference is provided. This affects base composition and can influence error patterns and sequencing behavior.
- genomesize=10m
- Size of synthetic genome to generate when no input reference is provided. Accepts standard suffixes (k, m, g). Must be large enough to accommodate the specified fragment lengths and inverted repeat requirements.
- irrate=0.0
- Add inverted repeats until this fraction of the genome consists of inverted repeats. Creates palindromic sequences that form hairpin structures, mimicking natural genomic features that can cause sequencing artifacts.
- irminlen=500
- Minimum length of inverted repeats to add to the reference genome. Shorter repeats may not form stable secondary structures under sequencing conditions.
- irmaxlen=5000
- Maximum length of inverted repeats. Longer repeats create stronger hairpin structures but are less common in natural genomes.
- irlen=
- Set irminlen and irmaxlen to the same number for uniform inverted repeat lengths. Creates consistent secondary structure strength across all added repeats.
- miner=0.05
- (minerrorrate) Minimum error rate (0.0-1.0) for base calling simulation. Represents the best-case accuracy achievable under optimal conditions. PacBio error rates are typically 5-30%.
- maxer=0.28
- (maxerrorrate) Maximum error rate (0.0-1.0) for base calling simulation. Represents worst-case accuracy under challenging conditions such as homopolymer regions or secondary structures.
- er=
- (errorrate) Set minerrorrate and maxerrorrate to the same value for uniform error rates across all reads. Useful for controlled simulations where error variation should be minimized.
Java Parameters
- -Xmx
- Set Java heap memory usage, overriding automatic detection. Format: -Xmx20g (20 gigabytes) or -Xmx200m (200 megabytes). Maximum should typically be 85% of available physical memory. Large genomes and high read counts require more memory.
- -eoom
- Exit on out-of-memory exceptions instead of attempting recovery. Requires Java 8u92 or later. Recommended for production pipelines to avoid corrupted output from memory-constrained runs.
- -da
- Disable Java assertions to improve runtime performance. Assertions provide additional error checking during development but add computational overhead in production use.
Examples
Basic Synthetic Read Generation
icecreammaker.sh out=synthetic.fq zmws=10000 minlen=1000 maxlen=8000 gc=0.4
Generate 10,000 ZMWs from a synthetic AT-rich genome (40% GC) with fragments ranging from 1-8kb. No ice cream artifacts are introduced (default missingrate=0).
Ice Cream Cone Simulation
icecreammaker.sh in=reference.fa out=icecream.fq zmws=5000 missingrate=0.15 hiddenrate=0.05 minlen=2000 maxlen=6000
Simulate ice cream cone artifacts with 15% missing adapters and 5% hidden adapters using a provided reference genome. Fragment sizes range from 2-6kb.
CCS Read Generation
icecreammaker.sh out=ccs_reads.fq zmws=1000 ccs=t minmovie=15000 maxmovie=25000 miner=0.02 maxer=0.08
Generate high-accuracy CCS reads with longer movies (15-25kb) and lower error rates (2-8%) suitable for consensus sequence generation.
Genome with Inverted Repeats
icecreammaker.sh out=repeat_reads.fq genomesize=1m irrate=0.1 irminlen=200 irmaxlen=2000 zmws=20000
Create a 1Mb synthetic genome where 10% consists of inverted repeats (200bp-2kb), then generate 20,000 ZMWs. Useful for testing repeat resolution algorithms.
Error Rate Analysis
icecreammaker.sh out=error_test.fq idhist=identity_dist.txt zmws=5000 miner=0.05 maxer=0.30 len=3000
Generate reads with variable error rates (5-30%) and uniform fragment length (3kb) while outputting identity histogram for accuracy analysis.
Algorithm Details
Read Generation Strategy
IceCreamMaker employs multi-threaded processing via ProcessThread workers that share an AtomicLong ZMW counter for thread-safe ID allocation. Each thread uses thread-local Random instances seeded from (seed + (tid+1)*tid*1000L) for reproducible parallel generation.
Reference Preparation
When no input reference is provided, the tool generates a synthetic genome using the specified GC content. Random bases are selected with weighted probability matching the target GC ratio. If inverted repeats are requested (irrate > 0), palindromic sequences are added by:
- Randomly selecting positions that can accommodate the repeat length
- Ensuring no N bases interfere with palindrome formation
- Creating perfect reverse complement sequences using AminoAcid.baseToComplementExtended[] at the target location
- Continuing until the desired fraction of inverted repeat content is achieved
Fragment Selection and Movie Simulation
For each ZMW (Zero Mode Wavelength hole), the algorithm:
- Selects fragment length using randomLength() with double random sampling (Tools.min(randy.nextInt(range), randy.nextInt(range))) to bias toward shorter fragments
- Extracts fragments via Arrays.copyOfRange() from random genome positions, with 50% probability reverse complement using AminoAcid.reverseComplementBasesInPlace()
- Determines movie length using randomRate() with weighted distribution (0.5f*Tools.min(b,c)) where b=(nextFloat()+nextFloat()) and c=(1.6f*nextFloat()+0.4f*nextFloat())
- Simulates multiple passes through fragments via baseCallAllPasses(), alternating strand orientation with AminoAcid.reverseComplementBasesInPlace() after each pass
- Models 44bp PacBio adapter sequence (ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAT) between passes with baseCallAdapter()
Error Model Implementation
Base calling errors are simulated using a three-component model reflecting real PacBio error characteristics with thresholds calculated from configured fractions:
- Insertion errors (40% of total errors): Random bases inserted using AminoAcid.numberToBase[randy.nextInt(4)] independent of template sequence
- Deletion errors (35% of total errors): Template bases skipped during base calling with no output generated
- Substitution errors (25% of total errors): Incorrect base called using (baseToNumber[b]+randy.nextInt(3)+1) & 3 to avoid the correct base
Error rates vary per ZMW using randomRate() between the specified minimum and maximum with biased distribution favoring moderate error rates over extremes.
Ice Cream Cone Artifact Simulation
The characteristic "ice cream cone" artifacts arise from missing or hidden adapter sequences:
Missing Adapters (missingrate parameter):
- Randomly selects reads where adapters fail to ligate properly using randy.nextFloat() < missingRate
- Causes polymerase to read through hairpin structures without pausing
- Creates chimeric reads containing both forward and reverse complement sequence via ReadBuilder.add()
- Alternates between consecutive subreads using (i&1)==missingMod to simulate adapter-less connections
Hidden Adapters (hiddenrate parameter):
- Simulates adapters present in the molecule but not detected during processing
- Inserts error-prone adapter sequences between subreads using baseCallAdapter(errorRate)
- Creates reads where adapter calls are missed, leading to apparent chimeric structure
- Models downstream processing failures rather than library preparation issues
CCS Mode Processing
When CCS (Circular Consensus Sequence) mode is enabled:
- Multiple subreads from each ZMW are generated as usual via baseCallAllPasses()
- A median-length subread is selected as the consensus representative using IntList.sort() on lengths
- Only full-pass reads (those spanning the entire fragment) are eligible for consensus
- Error rates still apply to simulate imperfect consensus calling
- One final read is output per ZMW instead of multiple subreads
Quality Score Simulation
Quality scores are assigned using Shared.FAKE_QUAL=8 (Phred score 8) for all positions, representing the typical range of PacBio quality scores. The implementation uses a fixed quality value rather than position-specific or context-dependent scoring.
Performance Optimizations
The implementation includes several performance optimizations for large-scale simulation:
- Multi-threading: Parallel processing across available CPU cores with AtomicLong ZMW ID allocation
- Memory management: ByteBuilder for dynamic sequence construction and Arrays.copyOfRange() for fragment extraction
- Streaming I/O: ConcurrentReadOutputStream with configurable buffer sizes (buff=4) for concurrent read/write operations
- Batch processing: Reads generated in batches of Shared.bufferLen() to balance memory usage and throughput
Statistical Tracking
The tool maintains detailed statistics during generation via thread-local arrays:
- Identity histogram: Distribution of sequence accuracy levels across all generated reads using idHistT[ID_BINS] arrays
- Read counts: Total ZMWs processed and reads generated per thread
- Base counts: Total nucleotides generated for throughput estimation
- Error tracking: Per-thread error accumulation using Tools.add() for histogram merging
Technical Notes
Memory Requirements
Memory usage scales with genome size and read count. The reference genome is loaded entirely into memory (limited to MAX_GENOME_LENGTH=2GB), and each thread maintains buffers for read generation. For large simulations, consider:
- Increase heap size appropriately (-Xmx parameter)
- Monitor memory usage during long runs
- Use streaming processing for very large datasets
Random Number Generation
The tool uses thread-local random number generators seeded from the specified seed value (or system time if not provided). This ensures:
- Reproducible results when using the same seed
- Thread safety in multi-core environments
- Statistical independence between different simulation runs
Output Format
Generated reads use FASTQ format with descriptive headers containing:
- ZMW identifier for tracking read origins
- Subread numbers for multi-pass reads
- Pass counts for quality assessment
- Movie position information for temporal analysis
Validation and Quality Control
The implementation includes extensive parameter validation via validateParams() to prevent invalid configurations:
- Length parameter consistency (min ≤ max for all length ranges)
- Rate parameter bounds (0.0 ≤ rate ≤ 1.0 for all rates)
- Genome size compatibility with fragment and repeat requirements
- Memory sufficiency for the specified simulation scale
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org