IceCreamMaker

Script: icecreammaker.sh Package: icecream Class: IceCreamMaker.java

Generates synthetic PacBio reads to mimic the chimeric inverted repeats from 'triangle reads', aka 'ice cream cones' - reads missing one adapter.

Basic Usage

icecreammaker.sh in=<file> out=<file> reads=100k minlen=500 maxlen=5000

This tool simulates PacBio sequencing artifacts where adapter sequences are missing or hidden, creating chimeric reads with inverted repeats. The "ice cream cone" name refers to the triangular shape these reads form when plotted, resembling ice cream cones.

Parameters

Parameters control various aspects of synthetic read generation, including sequence properties, error simulation, and ice cream cone artifact simulation. Length parameters distinguish between subread lengths (genomic sequence) and movie lengths (full sequencing output with concatenated subreads).

Standard parameters

in=<file>
A reference genome fasta (optional). If not provided, a synthetic genome will be generated using the specified GC content and genome size.
out=<file>
Synthetic read output file. Reads are generated in FASTQ format with simulated quality scores.
idhist=<file>
Identity histogram output file. Writes a histogram showing the distribution of sequence identities (accuracy levels) across generated reads.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default behavior prevents accidental data loss.
ziplevel=2
(zl) Set compression level from 1 (lowest/fastest) through 9 (maximum/slowest). Lower compression reduces CPU time at the cost of larger file sizes.

Length parameters

minlen=500
(minlength) Minimum length of genomic sequence molecules in base pairs. This controls the size of DNA fragments being sequenced, not the final read length.
maxlen=5000
(maxlength) Maximum length of genomic sequence molecules in base pairs. Fragment lengths are randomly selected between minlen and maxlen using a biased distribution favoring shorter fragments.
len=
(length) Set minlen and maxlen to the same number for uniform fragment lengths. Useful for controlled simulations where fragment size variation should be eliminated.
minmovie=500
(minmov) Minimum length of sequencing movies in base pairs. Movies represent the total sequencing output including multiple passes of the same fragment plus adapter sequences.
maxmovie=40k
(maxmov) Maximum length of sequencing movies. Longer movies allow more passes through the same fragment, potentially improving consensus accuracy but taking more sequencing time.
movie=
(mov) Set minmov and maxmov to the same number for uniform movie lengths. Creates consistent sequencing time per ZMW.

Ice cream parameters

missingrate=0
(missing) Fraction of reads missing an adapter sequence (0.0-1.0). Missing adapters cause the polymerase to continue through hairpin structures, creating chimeric reads with inverted repeats. This simulates the primary ice cream cone artifact.
hiddenrate=0
(hidden) Fraction of adapter sequences that are present but not detected by base calling or downstream processing (0.0-1.0). Hidden adapters can cause similar artifacts to missing adapters in downstream analysis.
bothends=f
Allow missing or hidden adapters on both ends of the same fragment. Currently not fully implemented (marked as TODO in source code). When true, would create more complex chimeric structures.

Other parameters

zmws
(reads) Number of Zero Mode Wavelength holes (ZMWs) to generate. Each ZMW produces multiple subreads from the same DNA fragment. This parameter controls the total number of sequencing reactions, not the total number of output reads.
ccs=f
Make Circular Consensus Sequence (CCS) reads producing one consensus read per ZMW from full pass reads only. When enabled, multiple subreads are combined into a single high-accuracy consensus sequence. Error rate parameters still apply to simulate imperfect consensus.
gc=0.6
GC content fraction (0.0-1.0) for synthetic genome generation. Only used when no input reference is provided. This affects base composition and can influence error patterns and sequencing behavior.
genomesize=10m
Size of synthetic genome to generate when no input reference is provided. Accepts standard suffixes (k, m, g). Must be large enough to accommodate the specified fragment lengths and inverted repeat requirements.
irrate=0.0
Add inverted repeats until this fraction of the genome consists of inverted repeats. Creates palindromic sequences that form hairpin structures, mimicking natural genomic features that can cause sequencing artifacts.
irminlen=500
Minimum length of inverted repeats to add to the reference genome. Shorter repeats may not form stable secondary structures under sequencing conditions.
irmaxlen=5000
Maximum length of inverted repeats. Longer repeats create stronger hairpin structures but are less common in natural genomes.
irlen=
Set irminlen and irmaxlen to the same number for uniform inverted repeat lengths. Creates consistent secondary structure strength across all added repeats.
miner=0.05
(minerrorrate) Minimum error rate (0.0-1.0) for base calling simulation. Represents the best-case accuracy achievable under optimal conditions. PacBio error rates are typically 5-30%.
maxer=0.28
(maxerrorrate) Maximum error rate (0.0-1.0) for base calling simulation. Represents worst-case accuracy under challenging conditions such as homopolymer regions or secondary structures.
er=
(errorrate) Set minerrorrate and maxerrorrate to the same value for uniform error rates across all reads. Useful for controlled simulations where error variation should be minimized.

Java Parameters

-Xmx
Set Java heap memory usage, overriding automatic detection. Format: -Xmx20g (20 gigabytes) or -Xmx200m (200 megabytes). Maximum should typically be 85% of available physical memory. Large genomes and high read counts require more memory.
-eoom
Exit on out-of-memory exceptions instead of attempting recovery. Requires Java 8u92 or later. Recommended for production pipelines to avoid corrupted output from memory-constrained runs.
-da
Disable Java assertions to improve runtime performance. Assertions provide additional error checking during development but add computational overhead in production use.

Examples

Basic Synthetic Read Generation

icecreammaker.sh out=synthetic.fq zmws=10000 minlen=1000 maxlen=8000 gc=0.4

Generate 10,000 ZMWs from a synthetic AT-rich genome (40% GC) with fragments ranging from 1-8kb. No ice cream artifacts are introduced (default missingrate=0).

Ice Cream Cone Simulation

icecreammaker.sh in=reference.fa out=icecream.fq zmws=5000 missingrate=0.15 hiddenrate=0.05 minlen=2000 maxlen=6000

Simulate ice cream cone artifacts with 15% missing adapters and 5% hidden adapters using a provided reference genome. Fragment sizes range from 2-6kb.

CCS Read Generation

icecreammaker.sh out=ccs_reads.fq zmws=1000 ccs=t minmovie=15000 maxmovie=25000 miner=0.02 maxer=0.08

Generate high-accuracy CCS reads with longer movies (15-25kb) and lower error rates (2-8%) suitable for consensus sequence generation.

Genome with Inverted Repeats

icecreammaker.sh out=repeat_reads.fq genomesize=1m irrate=0.1 irminlen=200 irmaxlen=2000 zmws=20000

Create a 1Mb synthetic genome where 10% consists of inverted repeats (200bp-2kb), then generate 20,000 ZMWs. Useful for testing repeat resolution algorithms.

Error Rate Analysis

icecreammaker.sh out=error_test.fq idhist=identity_dist.txt zmws=5000 miner=0.05 maxer=0.30 len=3000

Generate reads with variable error rates (5-30%) and uniform fragment length (3kb) while outputting identity histogram for accuracy analysis.

Algorithm Details

Read Generation Strategy

IceCreamMaker employs multi-threaded processing via ProcessThread workers that share an AtomicLong ZMW counter for thread-safe ID allocation. Each thread uses thread-local Random instances seeded from (seed + (tid+1)*tid*1000L) for reproducible parallel generation.

Reference Preparation

When no input reference is provided, the tool generates a synthetic genome using the specified GC content. Random bases are selected with weighted probability matching the target GC ratio. If inverted repeats are requested (irrate > 0), palindromic sequences are added by:

Fragment Selection and Movie Simulation

For each ZMW (Zero Mode Wavelength hole), the algorithm:

Error Model Implementation

Base calling errors are simulated using a three-component model reflecting real PacBio error characteristics with thresholds calculated from configured fractions:

Error rates vary per ZMW using randomRate() between the specified minimum and maximum with biased distribution favoring moderate error rates over extremes.

Ice Cream Cone Artifact Simulation

The characteristic "ice cream cone" artifacts arise from missing or hidden adapter sequences:

Missing Adapters (missingrate parameter):

Hidden Adapters (hiddenrate parameter):

CCS Mode Processing

When CCS (Circular Consensus Sequence) mode is enabled:

Quality Score Simulation

Quality scores are assigned using Shared.FAKE_QUAL=8 (Phred score 8) for all positions, representing the typical range of PacBio quality scores. The implementation uses a fixed quality value rather than position-specific or context-dependent scoring.

Performance Optimizations

The implementation includes several performance optimizations for large-scale simulation:

Statistical Tracking

The tool maintains detailed statistics during generation via thread-local arrays:

Technical Notes

Memory Requirements

Memory usage scales with genome size and read count. The reference genome is loaded entirely into memory (limited to MAX_GENOME_LENGTH=2GB), and each thread maintains buffers for read generation. For large simulations, consider:

Random Number Generation

The tool uses thread-local random number generators seeded from the specified seed value (or system time if not provided). This ensures:

Output Format

Generated reads use FASTQ format with descriptive headers containing:

Validation and Quality Control

The implementation includes extensive parameter validation via validateParams() to prevent invalid configurations:

Support

For questions and support: