IceCreamFinder

Basic Usage

icecreamfinder.sh in=<input file> out=<output file> outb=<bad reads>

IceCreamFinder is designed specifically for processing PacBio long-read sequencing data to identify and filter chimeric reads caused by missing or improperly positioned adapter sequences. These problematic reads form characteristic inverted repeat patterns that create "triangle" or "ice cream cone" shaped structures when aligned to themselves.

Parameters

Parameters are organized by their function in the ice cream detection and filtering process. The tool uses sophisticated alignment algorithms to detect inverted repeats that indicate chimeric reads.

File I/O parameters

in=<file>: Primary input file containing PacBio reads in FASTQ or FASTA format. Supports gzipped files.
out=<file>: (outgood) Output file for reads that pass all filters and are considered high-quality, non-chimeric reads suitable for downstream analysis.
outa=<file>: (outambig) Output file for reads with inverted repeats where it is unclear whether the pattern is natural (biological) or artifactual (sequencing artifact). These reads require manual inspection or additional filtering.
outb=<file>: (outbad) Output file for reads strongly suspected to be chimeric due to missing or mispositioned adapter sequences. These reads should typically be excluded from analysis.
outj=<file>: (outjunction) Output file for junction sequences extracted from inverted repeat reads. These represent the breakpoints where chimeric reads were formed and can be useful for troubleshooting sequencing protocols.
stats=<file>: Redirect statistical output to this file instead of printing to screen. Contains detailed metrics about read processing and filtering results.
json=f: Print statistics in JSON format instead of human-readable text format. Useful for automated parsing and integration with analysis pipelines.
asrhist=<file>: Output histogram of adapter alignment score ratios. Useful for understanding adapter detection sensitivity and optimizing thresholds.
irsist=<file>: Output histogram of inverted repeat alignment score ratios. Provides insights into the distribution of inverted repeat quality scores.
ambig=: Control routing of ambiguous reads that contain inverted repeats but uncertain classification. They are always sent to outa if specified. Options: good (send to outgood), bad (send to outbad), good,bad (send to both), null (don't send to outgood or outbad).
overwrite=f: (ow) Allow overwriting of existing output files. Set to false to force program termination if output files already exist.
ziplevel=2: (zl) Compression level for gzipped output files, from 1 (fastest, largest) to 9 (slowest, smallest). Level 2 provides good balance of speed and compression.

Processing parameters

alignrc=t: Align the reverse-complement of each read to itself to detect inverted repeats. This is the core algorithm for identifying ice cream cone structures. Disabling reduces sensitivity significantly.
alignadapter=t: Perform adapter sequence alignment to identify reads containing visible adapter sequences. Works in conjunction with inverted repeat detection to classify chimeric reads.
adapter=: Adapter sequence to search for in reads. Default: ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT (PacBio SMRTbell adapter sequence).
icecreamonly=t: (ico) Filter only suspected triangle reads (ice cream cones) with specific geometric properties. When false, removes all inverted repeats regardless of their classification as biological or artifactual.
ksr=t: (keepshortreads) Retain non-triangle reads from ZMWs that contain triangle reads. Preserves potentially useful short reads that may represent legitimate fragments.
kzt=f: (keepzmwstogether) Send all reads from the same Zero Mode Waveguide (ZMW) to the same output file, maintaining read family relationships for downstream analysis.
targetqlen=352: (qlen) Target length for query sequences extracted from read tips for inverted repeat detection. Optimized for PacBio read characteristics and adapter spacing.
qlenfraction=0.15: Maximum fraction of read length to use for query sequences. For short reads, this override targetqlen to prevent using excessively large query segments that reduce sensitivity.
minlen=40: Minimum read length after trimming operations. Reads shorter than this threshold are discarded as too short for reliable analysis or downstream applications.
minqlen=100: Minimum query sequence length for inverted repeat detection. Very short queries lack sufficient specificity for reliable alignment. Overrides qlenfraction for very short reads.
shortfraction=0.4: Minimum fraction of read length that the short half of an inverted repeat must represent to classify the read as a triangle/ice cream cone. Prevents misclassification of reads with small terminal repeats.
ccs=f: Indicate that input reads are Circular Consensus Sequences (CCS/HiFi reads), which are full-pass, high-accuracy reads. When true, increases minratio thresholds as these reads have higher expected accuracy.
trim=t: Trim adapter sequences detected at read termini. Removes adapter contamination that can interfere with downstream analysis while preserving the biological sequence content.
trimpolya=f: Trim terminal poly-A and poly-T sequences commonly found in some Iso-Seq libraries. Useful for transcript sequencing applications where these represent biological features.
minpolymer=5: Minimum length of poly-A sequence required for trimming. Prevents removal of short A/T runs that may be biological rather than technical artifacts.
polyerror=0.2: Maximum error rate tolerated when identifying poly-A sequences for trimming. Accommodates sequencing errors within genuine poly-A tails.

Speed and sensitivity parameters

jni=f: Enable optimized C code implementation through Java Native Interface for significantly higher processing speed with identical results. Requires compiled JNI libraries.
minratio=: Fraction of maximum possible alignment score required to consider alignments as significant matches. Controls the stringency of inverted repeat detection. Higher values reduce false positives but may miss degraded chimeric reads. Roughly equivalent to twice the per-read error rate.
minratio1=0.59: Alignment score ratio threshold for the initial inverted repeat detection pass. Lower threshold allows detection of lower-quality inverted repeats that may be confirmed in subsequent analysis.
minratio2=0.64: Alignment score ratio threshold for the refined alignment pass after initial junction identification. Higher threshold ensures high confidence in final classification decisions.
adapterratio=0.18: Initial adapter detection sensitivity threshold affecting processing speed. Lower values increase sensitivity but require more computation time for comprehensive adapter searching.
adapterratio2=.325: Final adapter detection sensitivity threshold used for definitive adapter identification. Higher threshold reduces false positive adapter calls.
minscore=-800: Minimum alignment score threshold for early termination of alignment calculations. Aggressive optimization that improves speed by abandoning poor alignments early. Value of -800 is conservative and safe.

Entropy parameters (recommended setting is 'entropy=t')

minentropy=-1: Minimum sequence complexity threshold for read retention (scale 0-1). Recommended value 0.55 removes low-complexity reads that may represent artifacts. Values above 0.7 are too stringent. Negative values disable entropy filtering.
entropyk=3: K-mer length used for entropy calculation. Value of 3 provides good balance between sensitivity to low-complexity regions and computational efficiency.
entropylen=350: Minimum length of consecutive low-entropy sequence required to trigger read removal. Prevents removal of reads with short low-complexity regions that may be biologically relevant.
entropyfraction=0.5: Alternative minimum length threshold expressed as fraction of read length. The shorter of entropylen and (entropyfraction × read length) determines the actual threshold for each read.
entropywindow=50: Sliding window size used for entropy calculation. Larger windows smooth over local complexity variations while smaller windows are more sensitive to localized low-complexity regions.
maxmonomerfraction=0.74: (mmf) Maximum fraction of bases in any entropy calculation window that can be the same nucleotide. Provides additional filter for homopolymer runs and other low-complexity sequences.

Java Parameters

-Xmx: Set maximum Java heap memory, overriding automatic detection. Examples: -Xmx20g (20 gigabytes), -Xmx2000m (2000 megabytes). Maximum is typically 85% of physical RAM. Default auto-detection usually sufficient.
-eoom: Exit immediately if out-of-memory exception occurs rather than attempting recovery. Useful for automated pipelines where memory exhaustion should trigger job termination. Requires Java 8u92 or later.
-da: Disable Java assertions for slight performance improvement in production environments. Assertions are debugging aids that verify internal program correctness but add minor computational overhead.

Examples

Basic Ice Cream Detection

icecreamfinder.sh in=pacbio_subreads.fastq out=clean_reads.fastq outb=chimeric_reads.fastq

Basic usage for PacBio subread filtering. Identifies and removes chimeric reads while preserving high-quality reads for downstream analysis.

Comprehensive Analysis with Junction Output

icecreamfinder.sh in=pacbio_raw.fastq out=good.fastq outa=ambiguous.fastq outb=bad.fastq outj=junctions.fasta stats=ice_cream_stats.txt

Complete analysis separating reads into good, ambiguous, and bad categories while extracting junction sequences and generating detailed statistics.

CCS/HiFi Read Processing

icecreamfinder.sh in=ccs_reads.fastq out=filtered_ccs.fastq ccs=t minratio=0.7 entropy=t minentropy=0.55

Configuration for high-accuracy CCS/HiFi reads with stricter alignment thresholds (minratio=0.7) and entropy filtering enabled (minentropy=0.55).

Iso-Seq Library Processing

icecreamfinder.sh in=isoseq_fl.fastq out=clean_isoseq.fastq trimpolya=t minpolymer=8 polyerror=0.15

Configuration for Iso-Seq full-length transcript libraries with poly-A tail trimming to remove technical artifacts while preserving biological sequences.

High-Sensitivity Detection

icecreamfinder.sh in=difficult_sample.fastq out=cleaned.fastq minratio1=0.55 minratio2=0.60 icecreamonly=f

Lower stringency settings for challenging samples where standard parameters may miss some chimeric reads. Removes all inverted repeats rather than just ice cream cones.

Algorithm Details

Ice Cream Detection Strategy

IceCreamFinder employs a dual-alignment strategy using alignLeft() and alignRight() methods from the IceCreamAligner class to identify chimeric PacBio reads caused by missing or mispositioned adapter sequences. The algorithm focuses on detecting characteristic "ice cream cone" or "triangle" patterns that form when sequencing proceeds through a missing adapter junction.

Inverted Repeat Detection

The core detection mechanism aligns query sequences extracted from read termini against the reverse-complement of the entire read. This identifies inverted repeats that indicate potential chimeric structures:

Query Extraction: Sequences of length targetqlen (default 352bp) are extracted from both 5' and 3' read termini
Reverse-Complement Alignment: Each query is reverse-complemented and aligned against the full read using optimized PacBio-specific alignment parameters
Junction Identification: Successful alignments indicate inverted repeats, with junction locations calculated based on alignment coordinates
Geometric Validation: Reads are classified as ice cream cones based on junction position and the relative size of repeat segments

Multi-Pass Alignment Protocol

The algorithm uses a two-pass alignment strategy for optimal sensitivity and specificity:

Initial Pass: Lower stringency alignment (minratio1=0.59) to detect potential inverted repeats
Refinement Pass: Higher stringency alignment (minratio2=0.64) after provisional junction identification for confirmation
Realignment: Optional realignment with adjusted query lengths based on junction position for improved accuracy

Adapter Detection Integration

Parallel adapter detection provides complementary evidence for chimeric read classification:

Sliding Window Search: Adapter sequences are aligned across reads using configurable window sizes and stride lengths
Score Threshold Cascade: Multiple sensitivity levels (adapterratio, adapterratio2) balance speed and sensitivity
Locality-Based Scoring: Nearby adapter hits are evaluated collectively to improve detection of degraded adapter sequences
Alternative Scoring: Secondary alignment algorithms provide additional validation for marginal adapter hits

ZMW-Level Analysis

Processing considers all reads from each Zero Mode Waveguide (ZMW) collectively:

Median Length Sampling: Representative reads are selected based on median length within each ZMW
Cross-Read Validation: Evidence from multiple subreads strengthens chimeric read classification
Family-Based Filtering: Options to process entire ZMW families together or independently based on analysis requirements

Quality Control Features

Multiple quality control mechanisms ensure robust filtering:

Entropy Filtering: Sequence complexity analysis removes low-complexity artifacts that may mimic chimeric patterns
Length Filtering: Minimum length thresholds prevent processing of degraded or fragment reads
Junction Validation: Geometric constraints ensure detected junctions represent genuine chimeric breakpoints
Ambiguity Handling: Sophisticated classification system separates high-confidence calls from uncertain cases

Performance Optimizations

The implementation includes several performance enhancements for large-scale processing:

Early Termination: Alignment calculations terminate early when scores fall below minimum thresholds
JNI Integration: Optional C code implementation provides significant speed improvements
Multithreading: ZMW-level parallelization scales efficiently across available CPU cores
Memory Management: Streaming processing minimizes memory requirements for large datasets

Output Interpretation

Read Classification Categories

Good Reads (out): High-quality reads with no evidence of chimeric structure, suitable for downstream analysis
Ambiguous Reads (outa): Reads containing inverted repeats where biological vs. artifactual origin is unclear
Bad Reads (outb): Reads classified as chimeric with high confidence, typically excluded from analysis
Junctions (outj): Extracted breakpoint sequences useful for protocol troubleshooting and quality assessment

Statistical Output Metrics

Processing Statistics: Read and base counts, processing time, filtering efficiency
ZMW Analysis: Bad ZMW identification, missing adapter detection, entropy filtering results
Quality Metrics: Average score ratios, alignment statistics, sensitivity analysis
Performance Data: Iteration counts, timing information for optimization

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org