CallPeaks

Script: callpeaks.sh Package: jgi Class: CallPeaks.java

Calls peaks from a 2-column (x, y) tab-delimited histogram. Designed primarily for analyzing k-mer frequency histograms to estimate genome characteristics including size, ploidy, heterozygosity rate, and repeat content.

Basic Usage

callpeaks.sh in=<histogram file> out=<output file>

CallPeaks reads a tab-delimited histogram file with x-values (typically k-mer counts) in the first column and frequencies in the second column. It identifies peaks in the histogram and provides detailed analysis including genome size estimation, ploidy detection, and repeat content calculation.

Parameters

Parameters control the finite state automaton behavior, preprocessing algorithms, and genome analysis calculations.

Peak-calling parameters

in=<file>
'in=stdin.fq' will pipe from standard in. Input should be a 2-column tab-delimited histogram file.
out=<file>
Write the peaks to this file. Default is stdout. Output includes peak coordinates and genome statistics.
minHeight=2
(h) Ignore peaks shorter than this. Minimum peak height threshold for detection.
minVolume=5
(v) Ignore peaks with less area than this. Minimum peak area (volume under curve) for inclusion.
minWidth=3
(w) Ignore peaks narrower than this. Minimum peak width in x-coordinate units.
minPeak=2
(minp) Ignore peaks with an X-value below this. Useful when low-count kmers are filtered. Sets minimum x-coordinate for peak detection.
maxPeak=BIG
(maxp) Ignore peaks with an X-value above this. Sets maximum x-coordinate for peak detection.
maxPeakCount=10
(maxpc) Print up to this many peaks (prioritizing height). Limits the number of peaks reported in output.
countColumn=1
(col) For multi-column input, this column, zero-based, contains the counts. Specifies which column contains the y-values (frequencies).
ploidy=-1
Specify ploidy; otherwise it will be autodetected. Use -1 for automatic detection, or specify expected ploidy (1, 2, 3, etc.).
logscale=f
Transform to log-scale prior to peak-calling. Useful for kmer-frequency histograms where peaks span multiple orders of magnitude.
k=31
K-mer size for genome analysis calculations. Used for estimating genome size and heterozygosity rates.
logwidth=0.1
Width parameter for log-scale transformation. Sets halfWidth value in window calculation: center±halfWidth*pos for weighted averaging boundaries.
logpasses=1
Number of passes for log-scale transformation. Each pass applies logScale() method iteratively using previous pass output as input array.
byheight
Sets CALL_MODE=BY_HEIGHT for peak condensation. callMetric() returns maxHeight instead of volume for priority ranking.
byvolume
Sets CALL_MODE=BY_VOLUME for peak condensation. callMetric() returns volume for priority ranking. Default behavior.
weightByRelief=f
Weight peaks by relief (prominence) in addition to volume or height.
ploidyLogic=2
Algorithm version for ploidy detection. Version 2 uses second.volume*minVolumeFraction threshold to distinguish contamination peaks from legitimate heterozygous peaks.

Smoothing parameters

smoothradius=0
Integer radius of triangle filter. sumPoint() method applies triangular weighting from 1 at edges to radius at center. Zero disables smoothing.
smoothprogressive=f
Set to true to widen the filter as the x-coordinate increases. Uses next=Math.ceil(1+next*progressiveMult) to increment radius at depth intervals.
maxradius=10
Maximum radius of progressive smoothing function. When radius exceeds maxradius, next=Integer.MAX_VALUE prevents further increases.
progressivemult=2
Increment radius each time depth increases by this factor. Default multiplier in next*progressiveMult calculation for progressive smoothing.

Examples

Basic Peak Calling

callpeaks.sh in=kmer_histogram.txt out=peaks.txt

Analyzes a k-mer frequency histogram and outputs detected peaks with genome statistics.

Sensitive Peak Detection

callpeaks.sh in=histogram.txt out=peaks.txt minHeight=1 minVolume=3 minWidth=2

Uses lower thresholds for more sensitive peak detection, useful for low-coverage samples.

Smoothed Analysis with Log Scale

callpeaks.sh in=kmer_hist.txt out=peaks.txt logscale=t smoothradius=3 smoothprogressive=t

Applies log-scale transformation and progressive smoothing for noisy k-mer histograms.

Diploid Genome Analysis

callpeaks.sh in=diploid_hist.txt out=diploid_peaks.txt ploidy=2 k=21

Analyzes a diploid genome using 21-mers, providing heterozygosity rate and genome size estimates.

Algorithm Details

CallPeaks implements a finite state automaton with UP/DOWN mode tracking for peak detection in k-mer frequency histograms:

Peak Detection State Machine

The core algorithm (lines 722-846 in CallPeaks.java) uses a two-state finite automaton:

Data Preprocessing Methods

Two preprocessing algorithms are available before peak detection:

Peak Condensation Algorithm

The condense() method implements a two-phase merging strategy when peaks exceed maxPeakCount:

Genome Analysis Calculations

Genome characteristic estimation uses specific mathematical formulations:

Performance Implementation

Memory allocation uses 120MB default (line 72) with 200MB total allocation. The algorithm processes histograms in single-pass O(n) time complexity. Progressive smoothing uses adaptive radius calculation (next=Math.ceil(1+next*progressiveMult)) to balance computational cost with accuracy. Peak processing scales linearly with detected peak count rather than histogram size.

Output Format

CallPeaks produces detailed output with genome statistics and peak coordinates:

Header Statistics

Peak Data

Tab-delimited columns for each detected peak:

Support

For questions and support: