KHist

Script: khist.sh Package: jgi Class: KmerNormalize.java

Generates kmer frequency histograms for genome size estimation and coverage analysis. Uses Count-Min Sketch data structures for memory-bounded kmer counting with probabilistic accuracy guarantees.

Basic Usage

khist.sh in=<input> hist=<histogram output>

KHist generates kmer depth histograms from sequencing data or assemblies to analyze coverage distribution, estimate genome size, and assess data quality. It is a specialized wrapper around BBNorm (KmerNormalize) that uses predefined parameters optimized for histogram generation rather than read normalization.

Relationship to BBNorm

KHist is one of three shell scripts (bbnorm.sh, ecc.sh, khist.sh) that call KmerNormalize with different default parameters. For histogram generation, KHist uses these fixed parameters:

All other BBNorm parameters remain available and can be overridden on the command line. This equivalent command produces the same result as the basic usage above:

bbnorm.sh in=reads.fq khist=khist.txt peaks=peaks.txt passes=1 prefilter minprob=0 minqual=0 mindepth=0

Parameters

KHist accepts all BBNorm parameters. The most commonly used parameters for histogram generation are listed below. For the complete parameter list, see bbnorm.sh documentation.

Input/Output Parameters

in=null
Primary input file. Use in2 for paired reads in a second file.
in2=null
Second input file for paired reads in two files.
hist=stdout
Output file for the kmer depth histogram. Default is stdout.
peaks=null
Output file for peak analysis and genome size estimation. Contains peak locations, genome size estimates, and ploidy predictions.
extra=null
Additional files to use for input (generating hash table) but not for output.
tablereads=-1
Use at most this many reads when building the hashtable (-1 means all).
reads=-1
Only process this number of reads, then quit (-1 means all).

Hashing Parameters

k=31
Kmer length (values under 32 provide best performance, but arbitrarily high values are supported).
bits=32
Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32. Maximum kmer depth recorded is 2^bits-1. Fixed at 32 for KHist to capture high-depth kmers.
hashes=3
Number of times each kmer is hashed and stored. Higher is slower but more accurate if there is enough memory.
cells=-1
Number of cells in bloom filter. Auto-calculated based on available memory if not specified.
prefilter=true
Store low-count kmers in smaller cells (2-bit) and high-count kmers in larger cells (32-bit) for memory efficiency. Enabled by default for KHist.
prefilterbits=2
Bits per cell in prefilter bloom filter. Default 2-bit allows counts up to 3.
minq=6
Ignore kmers containing bases with quality below this.
minprob=0
Ignore kmers with overall probability of correctness below this. Set to 0 for KHist to include all kmers.
canonical=true
Use canonical kmers (count forward and reverse-complement kmers together).

Histogram Parameters

histlen=1048576
Maximum kmer depth displayed in histogram. Also affects statistics displayed.
histcol=3
(histogramcolumns) Number of histogram columns, 2 or 3.
pzc=f
(printzerocoverage) Print lines in the histogram with zero coverage.
zerobin=f
Set to true if you want kmers with a count of 0 to go in the 0 bin instead of the 1 bin.
minheight=1
Minimum peak height for genome size estimation.
minvolume=5
Minimum peak volume for genome size estimation.
minwidth=3
Minimum peak width for genome size estimation.
minpeak=2
Minimum depth for peak detection.
maxpeak=100000
Maximum depth for peak detection.
ploidy=2
Expected ploidy for genome size estimation (1-4 supported).

Performance Parameters

threads=auto
Spawn exactly X hashing threads (default is number of logical processors). Total active threads may exceed X due to I/O threads.
-Xmx
Set Java's memory usage, overriding autodetection. Example: -Xmx20g for 20 GB RAM.
deterministic=t
Generate random numbers deterministically to ensure identical output between runs. May decrease speed with many threads.

Quality and Filtering Parameters

qin=auto
ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.
rdk=t
(removeduplicatekmers) When true, a kmer's count will only be incremented once per read pair.
kmersample=1
Process every nth kmer, and skip the rest.
readsample=1
Process every nth read, and skip the rest.

Algorithm Details

Count-Min Sketch Data Structure

KHist uses a Count-Min Sketch (CMS), also called a "counting Bloom filter," as its core data structure. This probabilistic hash table stores only values (counts), not keys (kmers), and handles collisions through multiple independent hash functions. The key advantages:

1. Hash Table Construction

Creates Count-Min Sketch with configurable parameters:

  • Cell Size: 1, 2, 4, 8, 16, or 32 bits per cell (KHist uses 32)
  • Hash Functions: Default 3 independent hash functions for collision avoidance
  • Capacity: 1GB RAM accommodates 4 billion 2-bit cells or 500 million 16-bit cells

For 32-bit cells with expected coverage of 200x, KHist can handle counts up to 4.3 billion per kmer.

2. Memory Management and Load Factor

KHist calculates usable memory with the formula:

usable_memory = max(((memory-96MB)*0.73), memory*0.45)

Cell allocation considers histogram overhead:

cells = (usable_memory * 8) / bits_per_cell

Load Factor Warnings

KHist monitors hash table load factor and issues warnings:

  • Under 60%: Maximum accuracy with minimal hash collisions
  • 60-90%: Good accuracy for normalization, reduced histogram precision
  • Over 90%: Warning issued - use prefilter, increase memory, or filter low-quality reads

3. Prefilter Strategy

KHist enables prefilter by default for memory efficiency. The two-tier system:

  • Primary Filter: 32-bit cells for high-count kmers (accurate counts)
  • Prefilter: 2-bit cells for low-count kmers (counts up to 3)

This hybrid approach maximizes memory utilization while maintaining accuracy for the most important high-depth kmers.

4. Kmer Processing

Processes each read by:

  • Extracting all k-length subsequences
  • Converting to canonical form (lexicographically smaller of forward/reverse-complement)
  • Applying quality and probability filters
  • Hashing with multiple independent hash functions
  • Incrementing counts in hash table using atomic operations
  • Using minimum count across hash functions to handle collisions

5. Histogram Generation and Peak Analysis

After processing all reads, generates histogram and identifies peaks for genome analysis:

  • Histogram: Distribution of kmer depths across the dataset
  • Peak Detection: Identifies coverage peaks corresponding to single-copy, repetitive, and error kmers
  • Genome Size Estimation: Uses peak analysis to estimate haploid genome size
  • Ploidy Prediction: Analyzes peak ratios to infer genome ploidy (1-4x supported)

Examples

Basic Histogram Generation

khist.sh in=reads.fastq hist=kmer_histogram.txt

Generates a kmer count histogram from sequencing reads and saves it to a file.

Genome Size Estimation

khist.sh in=reads.fq khist=khist.txt peaks=peaks.txt

Generates both histogram and peaks file for genome size estimation. The peaks file contains peak locations, genome size estimates, and ploidy predictions based on coverage analysis.

Paired-End Reads

khist.sh in1=reads_R1.fastq in2=reads_R2.fastq hist=kmer_histogram.txt peaks=peaks.txt

Processes paired-end reads to generate a combined kmer histogram with genome size estimation.

High-Memory Analysis with Custom Parameters

khist.sh in=reads.fastq hist=histogram.txt -Xmx64g histlen=10000000 prefilter

Allocates 64GB memory, extends histogram length to capture very high-depth kmers, and explicitly enables prefilter for maximum accuracy.

Quality-Filtered Analysis

khist.sh in=reads.fastq hist=histogram.txt minq=10 minprob=0.8 k=21

Uses quality filtering to exclude unreliable kmers and shorter kmers (21-mers) for highly repetitive genomes or low-coverage data.

Multiple Input Files

khist.sh in=reads1.fastq,reads2.fastq,reads3.fastq hist=combined_histogram.txt

Generates a combined histogram from multiple sequencing files.

Output Format

Histogram File

KHist produces a tab-delimited histogram with the following columns:

#Depth	Count	Fraction
1	15234567	0.45123
2	8901234	0.26234
3	5432109	0.16345
...

Column Descriptions:

  • Depth: Kmer count (how many times the kmer appears)
  • Count: Number of unique kmers with this depth
  • Fraction: Proportion of total unique kmers at this depth

Peaks File

When peaks=file is specified, KHist generates a peaks analysis file containing:

Note: Peak analysis is only accurate for randomly-sheared isolate genomic DNA with minimal contamination and ploidy ≤4.

Statistics Summary

Before the histogram, KHist outputs summary statistics including:

Applications

Genome Size Estimation

Primary use case for determining haploid genome size from sequencing data. KHist identifies the coverage peak corresponding to single-copy genomic sequences and calculates total genome size using the formula: genome_size = total_bases / peak_coverage.

Coverage Distribution Analysis

Reveals sequencing depth uniformity and identifies regions of unusual coverage that may indicate repetitive elements, contamination, or technical artifacts.

Assembly Planning

Kmer depth distributions help choose appropriate parameters for genome assemblers by revealing the expected coverage, error rates, and repetitive content structure.

Data Quality Assessment

The shape of low-depth regions indicates sequencing errors, adapter contamination, or other data quality issues. High error rates appear as elevated counts at depth 1-2.

Contamination Detection

Unexpected peaks may indicate contaminating DNA from other organisms, adapter sequences, or PCR duplicates that deviate from expected genomic coverage patterns.

Repetitive Element Analysis

High-depth peaks in the histogram indicate repetitive elements, allowing quantitative assessment of genome complexity and repeat content distribution.

Performance

Memory Requirements

Memory usage depends on the hash table size, which is automatically calculated based on available system memory. KHist uses approximately 73% of available RAM for the hash table, with the remainder reserved for Java overhead and histogram arrays.

Memory Efficiency Strategy

The Count-Min Sketch approach provides several advantages:

Speed Optimization

KHist is optimized for speed through:

Typical Performance

Processing rates vary by system but typically achieve:

Notes

Memory and Accuracy

KHist requires substantial memory for large datasets. For whole genome sequencing data, allocate at least 16-32 GB RAM. Monitor the hash table load factor in the output:

  • Under 60%: Excellent accuracy
  • 60-90%: Good for most applications
  • Over 90%: Warning issued - consider increasing memory or using quality filters

Limitations

KHist cannot dump individual kmers and their counts because the Count-Min Sketch stores only counts, not the kmers themselves. For exact kmer enumeration, use KmerCountExact instead, which explicitly tracks both kmers and their exact counts but requires more memory.

Parameter Selection Guidelines

For most applications, the default parameters work well. Consider adjusting:

  • k: Smaller values (21-25) for highly repetitive genomes or low-coverage data
  • minq: Higher values (10-15) for noisy sequencing data
  • prefilter: Always enable for maximum accuracy when memory is limited
  • histlen: Increase for data with very high coverage or amplification

Input Compatibility: KHist works with FASTQ and FASTA files (gzipped or uncompressed). For multiple samples, process each separately to maintain genome size estimation accuracy, or concatenate input files for combined analysis.

When NOT to Use KHist

KHist and normalization tools are not appropriate for:

  • Quantitative Analysis: ChIP-seq, RNA-seq expression profiling, or other applications requiring precise read counts
  • Variant Discovery: Mapping for variant calling where coverage bias could affect results
  • High-Error Platforms: PacBio or Nanopore data designed for short, low-error reads
  • Low Coverage Data: Cannot inflate coverage below assembly thresholds

Related Tools