CallPeaks

Basic Usage

callpeaks.sh in=<histogram file> out=<output file>

CallPeaks reads a tab-delimited histogram file with x-values (typically k-mer counts) in the first column and frequencies in the second column. It identifies peaks in the histogram and provides detailed analysis including genome size estimation, ploidy detection, and repeat content calculation.

Parameters

Parameters control the finite state automaton behavior, preprocessing algorithms, and genome analysis calculations.

Peak-calling parameters

in=<file>: 'in=stdin.fq' will pipe from standard in. Input should be a 2-column tab-delimited histogram file.
out=<file>: Write the peaks to this file. Default is stdout. Output includes peak coordinates and genome statistics.
minHeight=2: (h) Ignore peaks shorter than this. Minimum peak height threshold for detection.
minVolume=5: (v) Ignore peaks with less area than this. Minimum peak area (volume under curve) for inclusion.
minWidth=3: (w) Ignore peaks narrower than this. Minimum peak width in x-coordinate units.
minPeak=2: (minp) Ignore peaks with an X-value below this. Useful when low-count kmers are filtered. Sets minimum x-coordinate for peak detection.
maxPeak=BIG: (maxp) Ignore peaks with an X-value above this. Sets maximum x-coordinate for peak detection.
maxPeakCount=10: (maxpc) Print up to this many peaks (prioritizing height). Limits the number of peaks reported in output.
countColumn=1: (col) For multi-column input, this column, zero-based, contains the counts. Specifies which column contains the y-values (frequencies).
ploidy=-1: Specify ploidy; otherwise it will be autodetected. Use -1 for automatic detection, or specify expected ploidy (1, 2, 3, etc.).
logscale=f: Transform to log-scale prior to peak-calling. Useful for kmer-frequency histograms where peaks span multiple orders of magnitude.
k=31: K-mer size for genome analysis calculations. Used for estimating genome size and heterozygosity rates.
logwidth=0.1: Width parameter for log-scale transformation. Sets halfWidth value in window calculation: center±halfWidth*pos for weighted averaging boundaries.
logpasses=1: Number of passes for log-scale transformation. Each pass applies logScale() method iteratively using previous pass output as input array.
byheight: Sets CALL_MODE=BY_HEIGHT for peak condensation. callMetric() returns maxHeight instead of volume for priority ranking.
byvolume: Sets CALL_MODE=BY_VOLUME for peak condensation. callMetric() returns volume for priority ranking. Default behavior.
weightByRelief=f: Weight peaks by relief (prominence) in addition to volume or height.
ploidyLogic=2: Algorithm version for ploidy detection. Version 2 uses second.volume*minVolumeFraction threshold to distinguish contamination peaks from legitimate heterozygous peaks.

Smoothing parameters

smoothradius=0: Integer radius of triangle filter. sumPoint() method applies triangular weighting from 1 at edges to radius at center. Zero disables smoothing.
smoothprogressive=f: Set to true to widen the filter as the x-coordinate increases. Uses next=Math.ceil(1+next*progressiveMult) to increment radius at depth intervals.
maxradius=10: Maximum radius of progressive smoothing function. When radius exceeds maxradius, next=Integer.MAX_VALUE prevents further increases.
progressivemult=2: Increment radius each time depth increases by this factor. Default multiplier in next*progressiveMult calculation for progressive smoothing.

Examples

Basic Peak Calling

callpeaks.sh in=kmer_histogram.txt out=peaks.txt

Analyzes a k-mer frequency histogram and outputs detected peaks with genome statistics.

Sensitive Peak Detection

callpeaks.sh in=histogram.txt out=peaks.txt minHeight=1 minVolume=3 minWidth=2

Uses lower thresholds for more sensitive peak detection, useful for low-coverage samples.

Smoothed Analysis with Log Scale

callpeaks.sh in=kmer_hist.txt out=peaks.txt logscale=t smoothradius=3 smoothprogressive=t

Applies log-scale transformation and progressive smoothing for noisy k-mer histograms.

Diploid Genome Analysis

callpeaks.sh in=diploid_hist.txt out=diploid_peaks.txt ploidy=2 k=21

Analyzes a diploid genome using 21-mers, providing heterozygosity rate and genome size estimates.

Algorithm Details

CallPeaks implements a finite state automaton with UP/DOWN mode tracking for peak detection in k-mer frequency histograms:

Peak Detection State Machine

The core algorithm (lines 722-846 in CallPeaks.java) uses a two-state finite automaton:

UP state: Transitions to DOWN when x<prev, recording center position
DOWN state: Transitions to UP when x>prev, completing peak boundary detection
Mesa handling: Flat peak regions are centered using center=(center+j+2)/2 calculation
Valley centering: Valley boundaries are adjusted using similar midpoint calculation for zero and non-zero valleys

Data Preprocessing Methods

Two preprocessing algorithms are available before peak detection:

Log-scale transformation (logScale method): Implements weighted averaging within variable windows using center±halfWidth*pos boundaries. Multiple passes apply iterative smoothing with configurable scale factors.
Triangle smoothing: Uses sumPoint() method with triangular weighting: weight increases linearly from edges (x=1) to center (x=radius). Progressive mode increases radius using next*progressiveMult formula, capped at maxRadius.

Peak Condensation Algorithm

The condense() method implements a two-phase merging strategy when peaks exceed maxPeakCount:

Threshold calculation: Uses sorted arrays to determine height limit (heights[length-maxCount]) and volume limit for middle fraction
Peak compatibility: compatibleWith() method checks min*maxWidthMult≥max constraint for merging eligibility
Absorption process: absorb() method combines peaks by updating boundaries, volumes (volume+=p.volume), and GC content
Width capping: capWidth() enforces maximum width using center*maxWidthMult boundaries

Genome Analysis Calculations

Genome characteristic estimation uses specific mathematical formulations:

Ploidy detection: calcPloidy() uses peak volume ratios with second.volume*4<biggest.volume threshold and Math.round(max/(float)min) ratio calculation
Genome size estimation: Two methods: genomeSizeInPeaks() uses p.volume*Math.round(p.center*mult), while genomeSize2() sums histogram values with copy number correction
Heterozygosity rate: calcHetLocations() sums volumes for peaks below ploidy/2 threshold, divided by k-mer size and haploid genome size
Repeat analysis: repeatSize() calculates repetitive content using p.volume*(Math.round(p.center*mult)-1) formula for peaks beyond homozygous position
GC content: gcContent() methods weight GC counts by estimated copy number and normalize by k-mer size
Error k-mer detection: errorKmers() sums histogram values before the first genomic peak start position

Performance Implementation

Memory allocation uses 120MB default (line 72) with 200MB total allocation. The algorithm processes histograms in single-pass O(n) time complexity. Progressive smoothing uses adaptive radius calculation (next=Math.ceil(1+next*progressiveMult)) to balance computational cost with accuracy. Peak processing scales linearly with detected peak count rather than histogram size.

Output Format

CallPeaks produces detailed output with genome statistics and peak coordinates:

Header Statistics

#k - K-mer size used for analysis
#unique_kmers - Total unique k-mers in histogram
#error_kmers - Estimated erroneous k-mers
#genomic_kmers - Legitimate genomic k-mers
#main_peak - Center position of largest peak
#genome_size_in_peaks - Genome size estimated from peaks
#genome_size - Total genome size estimate
#haploid_genome_size - Size divided by ploidy
#fold_coverage - Average sequencing depth
#ploidy - Detected or specified ploidy
#het_rate - Heterozygosity rate (diploid+)
#percent_repeat - Percentage of repetitive sequence

Peak Data

Tab-delimited columns for each detected peak:

start - Starting x-coordinate of peak
center - Center x-coordinate of peak
stop - Ending x-coordinate of peak
max - Maximum height of peak
volume - Area under the peak curve
gc - GC content (if available)

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org