IceCreamMaker

Basic Usage

icecreammaker.sh in=<file> out=<file> reads=100k minlen=500 maxlen=5000

This tool simulates PacBio sequencing artifacts where adapter sequences are missing or hidden, creating chimeric reads with inverted repeats. The "ice cream cone" name refers to the triangular shape these reads form when plotted, resembling ice cream cones.

Parameters

Parameters control various aspects of synthetic read generation, including sequence properties, error simulation, and ice cream cone artifact simulation. Length parameters distinguish between subread lengths (genomic sequence) and movie lengths (full sequencing output with concatenated subreads).

Standard parameters

in=<file>: A reference genome fasta (optional). If not provided, a synthetic genome will be generated using the specified GC content and genome size.
out=<file>: Synthetic read output file. Reads are generated in FASTQ format with simulated quality scores.
idhist=<file>: Identity histogram output file. Writes a histogram showing the distribution of sequence identities (accuracy levels) across generated reads.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default behavior prevents accidental data loss.
ziplevel=2: (zl) Set compression level from 1 (lowest/fastest) through 9 (maximum/slowest). Lower compression reduces CPU time at the cost of larger file sizes.

Length parameters

minlen=500: (minlength) Minimum length of genomic sequence molecules in base pairs. This controls the size of DNA fragments being sequenced, not the final read length.
maxlen=5000: (maxlength) Maximum length of genomic sequence molecules in base pairs. Fragment lengths are randomly selected between minlen and maxlen using a biased distribution favoring shorter fragments.
len=: (length) Set minlen and maxlen to the same number for uniform fragment lengths. Useful for controlled simulations where fragment size variation should be eliminated.
minmovie=500: (minmov) Minimum length of sequencing movies in base pairs. Movies represent the total sequencing output including multiple passes of the same fragment plus adapter sequences.
maxmovie=40k: (maxmov) Maximum length of sequencing movies. Longer movies allow more passes through the same fragment, potentially improving consensus accuracy but taking more sequencing time.
movie=: (mov) Set minmov and maxmov to the same number for uniform movie lengths. Creates consistent sequencing time per ZMW.

Ice cream parameters

missingrate=0: (missing) Fraction of reads missing an adapter sequence (0.0-1.0). Missing adapters cause the polymerase to continue through hairpin structures, creating chimeric reads with inverted repeats. This simulates the primary ice cream cone artifact.
hiddenrate=0: (hidden) Fraction of adapter sequences that are present but not detected by base calling or downstream processing (0.0-1.0). Hidden adapters can cause similar artifacts to missing adapters in downstream analysis.
bothends=f: Allow missing or hidden adapters on both ends of the same fragment. Currently not fully implemented (marked as TODO in source code). When true, would create more complex chimeric structures.

Other parameters

zmws: (reads) Number of Zero Mode Wavelength holes (ZMWs) to generate. Each ZMW produces multiple subreads from the same DNA fragment. This parameter controls the total number of sequencing reactions, not the total number of output reads.
ccs=f: Make Circular Consensus Sequence (CCS) reads producing one consensus read per ZMW from full pass reads only. When enabled, multiple subreads are combined into a single high-accuracy consensus sequence. Error rate parameters still apply to simulate imperfect consensus.
gc=0.6: GC content fraction (0.0-1.0) for synthetic genome generation. Only used when no input reference is provided. This affects base composition and can influence error patterns and sequencing behavior.
genomesize=10m: Size of synthetic genome to generate when no input reference is provided. Accepts standard suffixes (k, m, g). Must be large enough to accommodate the specified fragment lengths and inverted repeat requirements.
irrate=0.0: Add inverted repeats until this fraction of the genome consists of inverted repeats. Creates palindromic sequences that form hairpin structures, mimicking natural genomic features that can cause sequencing artifacts.
irminlen=500: Minimum length of inverted repeats to add to the reference genome. Shorter repeats may not form stable secondary structures under sequencing conditions.
irmaxlen=5000: Maximum length of inverted repeats. Longer repeats create stronger hairpin structures but are less common in natural genomes.
irlen=: Set irminlen and irmaxlen to the same number for uniform inverted repeat lengths. Creates consistent secondary structure strength across all added repeats.
miner=0.05: (minerrorrate) Minimum error rate (0.0-1.0) for base calling simulation. Represents the best-case accuracy achievable under optimal conditions. PacBio error rates are typically 5-30%.
maxer=0.28: (maxerrorrate) Maximum error rate (0.0-1.0) for base calling simulation. Represents worst-case accuracy under challenging conditions such as homopolymer regions or secondary structures.
er=: (errorrate) Set minerrorrate and maxerrorrate to the same value for uniform error rates across all reads. Useful for controlled simulations where error variation should be minimized.

Java Parameters

-Xmx: Set Java heap memory usage, overriding automatic detection. Format: -Xmx20g (20 gigabytes) or -Xmx200m (200 megabytes). Maximum should typically be 85% of available physical memory. Large genomes and high read counts require more memory.
-eoom: Exit on out-of-memory exceptions instead of attempting recovery. Requires Java 8u92 or later. Recommended for production pipelines to avoid corrupted output from memory-constrained runs.
-da: Disable Java assertions to improve runtime performance. Assertions provide additional error checking during development but add computational overhead in production use.

Examples

Basic Synthetic Read Generation

icecreammaker.sh out=synthetic.fq zmws=10000 minlen=1000 maxlen=8000 gc=0.4

Generate 10,000 ZMWs from a synthetic AT-rich genome (40% GC) with fragments ranging from 1-8kb. No ice cream artifacts are introduced (default missingrate=0).

Ice Cream Cone Simulation

icecreammaker.sh in=reference.fa out=icecream.fq zmws=5000 missingrate=0.15 hiddenrate=0.05 minlen=2000 maxlen=6000

Simulate ice cream cone artifacts with 15% missing adapters and 5% hidden adapters using a provided reference genome. Fragment sizes range from 2-6kb.

CCS Read Generation

icecreammaker.sh out=ccs_reads.fq zmws=1000 ccs=t minmovie=15000 maxmovie=25000 miner=0.02 maxer=0.08

Generate high-accuracy CCS reads with longer movies (15-25kb) and lower error rates (2-8%) suitable for consensus sequence generation.

Genome with Inverted Repeats

icecreammaker.sh out=repeat_reads.fq genomesize=1m irrate=0.1 irminlen=200 irmaxlen=2000 zmws=20000

Create a 1Mb synthetic genome where 10% consists of inverted repeats (200bp-2kb), then generate 20,000 ZMWs. Useful for testing repeat resolution algorithms.

Error Rate Analysis

icecreammaker.sh out=error_test.fq idhist=identity_dist.txt zmws=5000 miner=0.05 maxer=0.30 len=3000

Generate reads with variable error rates (5-30%) and uniform fragment length (3kb) while outputting identity histogram for accuracy analysis.

Algorithm Details

Read Generation Strategy

IceCreamMaker employs multi-threaded processing via ProcessThread workers that share an AtomicLong ZMW counter for thread-safe ID allocation. Each thread uses thread-local Random instances seeded from (seed + (tid+1)*tid*1000L) for reproducible parallel generation.

Reference Preparation

When no input reference is provided, the tool generates a synthetic genome using the specified GC content. Random bases are selected with weighted probability matching the target GC ratio. If inverted repeats are requested (irrate > 0), palindromic sequences are added by:

Randomly selecting positions that can accommodate the repeat length
Ensuring no N bases interfere with palindrome formation
Creating perfect reverse complement sequences using AminoAcid.baseToComplementExtended[] at the target location
Continuing until the desired fraction of inverted repeat content is achieved

Fragment Selection and Movie Simulation

For each ZMW (Zero Mode Wavelength hole), the algorithm:

Selects fragment length using randomLength() with double random sampling (Tools.min(randy.nextInt(range), randy.nextInt(range))) to bias toward shorter fragments
Extracts fragments via Arrays.copyOfRange() from random genome positions, with 50% probability reverse complement using AminoAcid.reverseComplementBasesInPlace()
Determines movie length using randomRate() with weighted distribution (0.5f*Tools.min(b,c)) where b=(nextFloat()+nextFloat()) and c=(1.6f*nextFloat()+0.4f*nextFloat())
Simulates multiple passes through fragments via baseCallAllPasses(), alternating strand orientation with AminoAcid.reverseComplementBasesInPlace() after each pass
Models 44bp PacBio adapter sequence (ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAT) between passes with baseCallAdapter()

Error Model Implementation

Base calling errors are simulated using a three-component model reflecting real PacBio error characteristics with thresholds calculated from configured fractions:

Insertion errors (40% of total errors): Random bases inserted using AminoAcid.numberToBase[randy.nextInt(4)] independent of template sequence
Deletion errors (35% of total errors): Template bases skipped during base calling with no output generated
Substitution errors (25% of total errors): Incorrect base called using (baseToNumber[b]+randy.nextInt(3)+1) & 3 to avoid the correct base

Error rates vary per ZMW using randomRate() between the specified minimum and maximum with biased distribution favoring moderate error rates over extremes.

Ice Cream Cone Artifact Simulation

The characteristic "ice cream cone" artifacts arise from missing or hidden adapter sequences:

Missing Adapters (missingrate parameter):

Randomly selects reads where adapters fail to ligate properly using randy.nextFloat() < missingRate
Causes polymerase to read through hairpin structures without pausing
Creates chimeric reads containing both forward and reverse complement sequence via ReadBuilder.add()
Alternates between consecutive subreads using (i&1)==missingMod to simulate adapter-less connections

Hidden Adapters (hiddenrate parameter):

Simulates adapters present in the molecule but not detected during processing
Inserts error-prone adapter sequences between subreads using baseCallAdapter(errorRate)
Creates reads where adapter calls are missed, leading to apparent chimeric structure
Models downstream processing failures rather than library preparation issues

CCS Mode Processing

When CCS (Circular Consensus Sequence) mode is enabled:

Multiple subreads from each ZMW are generated as usual via baseCallAllPasses()
A median-length subread is selected as the consensus representative using IntList.sort() on lengths
Only full-pass reads (those spanning the entire fragment) are eligible for consensus
Error rates still apply to simulate imperfect consensus calling
One final read is output per ZMW instead of multiple subreads

Quality Score Simulation

Quality scores are assigned using Shared.FAKE_QUAL=8 (Phred score 8) for all positions, representing the typical range of PacBio quality scores. The implementation uses a fixed quality value rather than position-specific or context-dependent scoring.

Performance Optimizations

The implementation includes several performance optimizations for large-scale simulation:

Multi-threading: Parallel processing across available CPU cores with AtomicLong ZMW ID allocation
Memory management: ByteBuilder for dynamic sequence construction and Arrays.copyOfRange() for fragment extraction
Streaming I/O: ConcurrentReadOutputStream with configurable buffer sizes (buff=4) for concurrent read/write operations
Batch processing: Reads generated in batches of Shared.bufferLen() to balance memory usage and throughput

Statistical Tracking

The tool maintains detailed statistics during generation via thread-local arrays:

Identity histogram: Distribution of sequence accuracy levels across all generated reads using idHistT[ID_BINS] arrays
Read counts: Total ZMWs processed and reads generated per thread
Base counts: Total nucleotides generated for throughput estimation
Error tracking: Per-thread error accumulation using Tools.add() for histogram merging

Technical Notes

Memory Requirements

Memory usage scales with genome size and read count. The reference genome is loaded entirely into memory (limited to MAX_GENOME_LENGTH=2GB), and each thread maintains buffers for read generation. For large simulations, consider:

Increase heap size appropriately (-Xmx parameter)
Monitor memory usage during long runs
Use streaming processing for very large datasets

Random Number Generation

The tool uses thread-local random number generators seeded from the specified seed value (or system time if not provided). This ensures:

Reproducible results when using the same seed
Thread safety in multi-core environments
Statistical independence between different simulation runs

Output Format

Generated reads use FASTQ format with descriptive headers containing:

ZMW identifier for tracking read origins
Subread numbers for multi-pass reads
Pass counts for quality assessment
Movie position information for temporal analysis

Validation and Quality Control

The implementation includes extensive parameter validation via validateParams() to prevent invalid configurations:

Length parameter consistency (min ≤ max for all length ranges)
Rate parameter bounds (0.0 ≤ rate ≤ 1.0 for all rates)
Genome size compatibility with fragment and repeat requirements
Memory sufficiency for the specified simulation scale

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org