CallPeaks
Calls peaks from a 2-column (x, y) tab-delimited histogram. Designed primarily for analyzing k-mer frequency histograms to estimate genome characteristics including size, ploidy, heterozygosity rate, and repeat content.
Basic Usage
callpeaks.sh in=<histogram file> out=<output file>
CallPeaks reads a tab-delimited histogram file with x-values (typically k-mer counts) in the first column and frequencies in the second column. It identifies peaks in the histogram and provides detailed analysis including genome size estimation, ploidy detection, and repeat content calculation.
Parameters
Parameters control the finite state automaton behavior, preprocessing algorithms, and genome analysis calculations.
Peak-calling parameters
- in=<file>
- 'in=stdin.fq' will pipe from standard in. Input should be a 2-column tab-delimited histogram file.
- out=<file>
- Write the peaks to this file. Default is stdout. Output includes peak coordinates and genome statistics.
- minHeight=2
- (h) Ignore peaks shorter than this. Minimum peak height threshold for detection.
- minVolume=5
- (v) Ignore peaks with less area than this. Minimum peak area (volume under curve) for inclusion.
- minWidth=3
- (w) Ignore peaks narrower than this. Minimum peak width in x-coordinate units.
- minPeak=2
- (minp) Ignore peaks with an X-value below this. Useful when low-count kmers are filtered. Sets minimum x-coordinate for peak detection.
- maxPeak=BIG
- (maxp) Ignore peaks with an X-value above this. Sets maximum x-coordinate for peak detection.
- maxPeakCount=10
- (maxpc) Print up to this many peaks (prioritizing height). Limits the number of peaks reported in output.
- countColumn=1
- (col) For multi-column input, this column, zero-based, contains the counts. Specifies which column contains the y-values (frequencies).
- ploidy=-1
- Specify ploidy; otherwise it will be autodetected. Use -1 for automatic detection, or specify expected ploidy (1, 2, 3, etc.).
- logscale=f
- Transform to log-scale prior to peak-calling. Useful for kmer-frequency histograms where peaks span multiple orders of magnitude.
- k=31
- K-mer size for genome analysis calculations. Used for estimating genome size and heterozygosity rates.
- logwidth=0.1
- Width parameter for log-scale transformation. Sets halfWidth value in window calculation:
center±halfWidth*pos
for weighted averaging boundaries. - logpasses=1
- Number of passes for log-scale transformation. Each pass applies
logScale()
method iteratively using previous pass output as input array. - byheight
- Sets
CALL_MODE=BY_HEIGHT
for peak condensation.callMetric()
returns maxHeight instead of volume for priority ranking. - byvolume
- Sets
CALL_MODE=BY_VOLUME
for peak condensation.callMetric()
returns volume for priority ranking. Default behavior. - weightByRelief=f
- Weight peaks by relief (prominence) in addition to volume or height.
- ploidyLogic=2
- Algorithm version for ploidy detection. Version 2 uses
second.volume*minVolumeFraction
threshold to distinguish contamination peaks from legitimate heterozygous peaks.
Smoothing parameters
- smoothradius=0
- Integer radius of triangle filter.
sumPoint()
method applies triangular weighting from 1 at edges to radius at center. Zero disables smoothing. - smoothprogressive=f
- Set to true to widen the filter as the x-coordinate increases. Uses
next=Math.ceil(1+next*progressiveMult)
to increment radius at depth intervals. - maxradius=10
- Maximum radius of progressive smoothing function. When radius exceeds maxradius,
next=Integer.MAX_VALUE
prevents further increases. - progressivemult=2
- Increment radius each time depth increases by this factor. Default multiplier in
next*progressiveMult
calculation for progressive smoothing.
Examples
Basic Peak Calling
callpeaks.sh in=kmer_histogram.txt out=peaks.txt
Analyzes a k-mer frequency histogram and outputs detected peaks with genome statistics.
Sensitive Peak Detection
callpeaks.sh in=histogram.txt out=peaks.txt minHeight=1 minVolume=3 minWidth=2
Uses lower thresholds for more sensitive peak detection, useful for low-coverage samples.
Smoothed Analysis with Log Scale
callpeaks.sh in=kmer_hist.txt out=peaks.txt logscale=t smoothradius=3 smoothprogressive=t
Applies log-scale transformation and progressive smoothing for noisy k-mer histograms.
Diploid Genome Analysis
callpeaks.sh in=diploid_hist.txt out=diploid_peaks.txt ploidy=2 k=21
Analyzes a diploid genome using 21-mers, providing heterozygosity rate and genome size estimates.
Algorithm Details
CallPeaks implements a finite state automaton with UP/DOWN mode tracking for peak detection in k-mer frequency histograms:
Peak Detection State Machine
The core algorithm (lines 722-846 in CallPeaks.java) uses a two-state finite automaton:
- UP state: Transitions to DOWN when
x<prev
, recording center position - DOWN state: Transitions to UP when
x>prev
, completing peak boundary detection - Mesa handling: Flat peak regions are centered using
center=(center+j+2)/2
calculation - Valley centering: Valley boundaries are adjusted using similar midpoint calculation for zero and non-zero valleys
Data Preprocessing Methods
Two preprocessing algorithms are available before peak detection:
- Log-scale transformation (logScale method): Implements weighted averaging within variable windows using
center±halfWidth*pos
boundaries. Multiple passes apply iterative smoothing with configurable scale factors. - Triangle smoothing: Uses
sumPoint()
method with triangular weighting: weight increases linearly from edges (x=1) to center (x=radius). Progressive mode increases radius usingnext*progressiveMult
formula, capped at maxRadius.
Peak Condensation Algorithm
The condense()
method implements a two-phase merging strategy when peaks exceed maxPeakCount:
- Threshold calculation: Uses sorted arrays to determine height limit (
heights[length-maxCount]
) and volume limit for middle fraction - Peak compatibility:
compatibleWith()
method checksmin*maxWidthMult≥max
constraint for merging eligibility - Absorption process:
absorb()
method combines peaks by updating boundaries, volumes (volume+=p.volume
), and GC content - Width capping:
capWidth()
enforces maximum width usingcenter*maxWidthMult
boundaries
Genome Analysis Calculations
Genome characteristic estimation uses specific mathematical formulations:
- Ploidy detection:
calcPloidy()
uses peak volume ratios withsecond.volume*4<biggest.volume
threshold andMath.round(max/(float)min)
ratio calculation - Genome size estimation: Two methods:
genomeSizeInPeaks()
usesp.volume*Math.round(p.center*mult)
, whilegenomeSize2()
sums histogram values with copy number correction - Heterozygosity rate:
calcHetLocations()
sums volumes for peaks below ploidy/2 threshold, divided by k-mer size and haploid genome size - Repeat analysis:
repeatSize()
calculates repetitive content usingp.volume*(Math.round(p.center*mult)-1)
formula for peaks beyond homozygous position - GC content:
gcContent()
methods weight GC counts by estimated copy number and normalize by k-mer size - Error k-mer detection:
errorKmers()
sums histogram values before the first genomic peak start position
Performance Implementation
Memory allocation uses 120MB default (line 72) with 200MB total allocation. The algorithm processes histograms in single-pass O(n) time complexity. Progressive smoothing uses adaptive radius calculation (next=Math.ceil(1+next*progressiveMult)
) to balance computational cost with accuracy. Peak processing scales linearly with detected peak count rather than histogram size.
Output Format
CallPeaks produces detailed output with genome statistics and peak coordinates:
Header Statistics
#k
- K-mer size used for analysis#unique_kmers
- Total unique k-mers in histogram#error_kmers
- Estimated erroneous k-mers#genomic_kmers
- Legitimate genomic k-mers#main_peak
- Center position of largest peak#genome_size_in_peaks
- Genome size estimated from peaks#genome_size
- Total genome size estimate#haploid_genome_size
- Size divided by ploidy#fold_coverage
- Average sequencing depth#ploidy
- Detected or specified ploidy#het_rate
- Heterozygosity rate (diploid+)#percent_repeat
- Percentage of repetitive sequence
Peak Data
Tab-delimited columns for each detected peak:
start
- Starting x-coordinate of peakcenter
- Center x-coordinate of peakstop
- Ending x-coordinate of peakmax
- Maximum height of peakvolume
- Area under the peak curvegc
- GC content (if available)
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org