KHist

Basic Usage

khist.sh in=<input> hist=<histogram output>

KHist generates kmer depth histograms from sequencing data or assemblies to analyze coverage distribution, estimate genome size, and assess data quality. It is a specialized wrapper around BBNorm (KmerNormalize) that uses predefined parameters optimized for histogram generation rather than read normalization.

Relationship to BBNorm

KHist is one of three shell scripts (bbnorm.sh, ecc.sh, khist.sh) that call KmerNormalize with different default parameters. For histogram generation, KHist uses these fixed parameters:

bits=32: Uses 32-bit cells for maximum count range (up to 4.3 billion)
ecc=f: Disables error correction (not needed for histograms)
passes=1: Single pass processing for speed
keepall=true: Retains all reads (no normalization)
dr=f: Disables deterministic random numbers for faster processing
prefilter=true: Uses prefilter for better memory efficiency
hist=stdout: Outputs histogram to standard output
minprob=0: Minimum probability threshold set to 0
minqual=0: Minimum quality threshold set to 0
mindepth=0: Minimum depth set to 0
minkmers=1: Minimum kmers per read set to 1
hashes=3: Uses 3 hash functions for collision resistance

All other BBNorm parameters remain available and can be overridden on the command line. This equivalent command produces the same result as the basic usage above:

bbnorm.sh in=reads.fq khist=khist.txt peaks=peaks.txt passes=1 prefilter minprob=0 minqual=0 mindepth=0

Parameters

KHist accepts all BBNorm parameters. The most commonly used parameters for histogram generation are listed below. For the complete parameter list, see bbnorm.sh documentation.

Input/Output Parameters

in=null: Primary input file. Use in2 for paired reads in a second file.
in2=null: Second input file for paired reads in two files.
hist=stdout: Output file for the kmer depth histogram. Default is stdout.
peaks=null: Output file for peak analysis and genome size estimation. Contains peak locations, genome size estimates, and ploidy predictions.
extra=null: Additional files to use for input (generating hash table) but not for output.
tablereads=-1: Use at most this many reads when building the hashtable (-1 means all).
reads=-1: Only process this number of reads, then quit (-1 means all).

Hashing Parameters

k=31: Kmer length (values under 32 provide best performance, but arbitrarily high values are supported).
bits=32: Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32. Maximum kmer depth recorded is 2^bits-1. Fixed at 32 for KHist to capture high-depth kmers.
hashes=3: Number of times each kmer is hashed and stored. Higher is slower but more accurate if there is enough memory.
cells=-1: Number of cells in bloom filter. Auto-calculated based on available memory if not specified.
prefilter=true: Store low-count kmers in smaller cells (2-bit) and high-count kmers in larger cells (32-bit) for memory efficiency. Enabled by default for KHist.
prefilterbits=2: Bits per cell in prefilter bloom filter. Default 2-bit allows counts up to 3.
minq=6: Ignore kmers containing bases with quality below this.
minprob=0: Ignore kmers with overall probability of correctness below this. Set to 0 for KHist to include all kmers.
canonical=true: Use canonical kmers (count forward and reverse-complement kmers together).

Histogram Parameters

histlen=1048576: Maximum kmer depth displayed in histogram. Also affects statistics displayed.
histcol=3: (histogramcolumns) Number of histogram columns, 2 or 3.
pzc=f: (printzerocoverage) Print lines in the histogram with zero coverage.
zerobin=f: Set to true if you want kmers with a count of 0 to go in the 0 bin instead of the 1 bin.
minheight=1: Minimum peak height for genome size estimation.
minvolume=5: Minimum peak volume for genome size estimation.
minwidth=3: Minimum peak width for genome size estimation.
minpeak=2: Minimum depth for peak detection.
maxpeak=100000: Maximum depth for peak detection.
ploidy=2: Expected ploidy for genome size estimation (1-4 supported).

Performance Parameters

threads=auto: Spawn exactly X hashing threads (default is number of logical processors). Total active threads may exceed X due to I/O threads.
-Xmx: Set Java's memory usage, overriding autodetection. Example: -Xmx20g for 20 GB RAM.
deterministic=t: Generate random numbers deterministically to ensure identical output between runs. May decrease speed with many threads.

Quality and Filtering Parameters

qin=auto: ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.
rdk=t: (removeduplicatekmers) When true, a kmer's count will only be incremented once per read pair.
kmersample=1: Process every nth kmer, and skip the rest.
readsample=1: Process every nth read, and skip the rest.

Algorithm Details

Count-Min Sketch Data Structure

KHist uses a Count-Min Sketch (CMS), also called a "counting Bloom filter," as its core data structure. This probabilistic hash table stores only values (counts), not keys (kmers), and handles collisions through multiple independent hash functions. The key advantages:

Memory Bounded: Never runs out of memory regardless of dataset size
Graceful Degradation: As unique kmer count increases, accuracy gradually declines rather than failing
Collision Resistant: Uses multiple hash functions to minimize false positives
Thread Safe: Atomic counters enable lock-free parallel processing

1. Hash Table Construction

Creates Count-Min Sketch with configurable parameters:

Cell Size: 1, 2, 4, 8, 16, or 32 bits per cell (KHist uses 32)
Hash Functions: Default 3 independent hash functions for collision avoidance
Capacity: 1GB RAM accommodates 4 billion 2-bit cells or 500 million 16-bit cells

For 32-bit cells with expected coverage of 200x, KHist can handle counts up to 4.3 billion per kmer.

2. Memory Management and Load Factor

KHist calculates usable memory with the formula:

usable_memory = max(((memory-96MB)*0.73), memory*0.45)

Cell allocation considers histogram overhead:

cells = (usable_memory * 8) / bits_per_cell

Load Factor Warnings

KHist monitors hash table load factor and issues warnings:

Under 60%: Maximum accuracy with minimal hash collisions
60-90%: Good accuracy for normalization, reduced histogram precision
Over 90%: Warning issued - use prefilter, increase memory, or filter low-quality reads

3. Prefilter Strategy

KHist enables prefilter by default for memory efficiency. The two-tier system:

Primary Filter: 32-bit cells for high-count kmers (accurate counts)
Prefilter: 2-bit cells for low-count kmers (counts up to 3)

This hybrid approach maximizes memory utilization while maintaining accuracy for the most important high-depth kmers.

4. Kmer Processing

Processes each read by:

Extracting all k-length subsequences
Converting to canonical form (lexicographically smaller of forward/reverse-complement)
Applying quality and probability filters
Hashing with multiple independent hash functions
Incrementing counts in hash table using atomic operations
Using minimum count across hash functions to handle collisions

5. Histogram Generation and Peak Analysis

After processing all reads, generates histogram and identifies peaks for genome analysis:

Histogram: Distribution of kmer depths across the dataset
Peak Detection: Identifies coverage peaks corresponding to single-copy, repetitive, and error kmers
Genome Size Estimation: Uses peak analysis to estimate haploid genome size
Ploidy Prediction: Analyzes peak ratios to infer genome ploidy (1-4x supported)

Examples

Basic Histogram Generation

khist.sh in=reads.fastq hist=kmer_histogram.txt

Generates a kmer count histogram from sequencing reads and saves it to a file.

Genome Size Estimation

khist.sh in=reads.fq khist=khist.txt peaks=peaks.txt

Generates both histogram and peaks file for genome size estimation. The peaks file contains peak locations, genome size estimates, and ploidy predictions based on coverage analysis.

Paired-End Reads

khist.sh in1=reads_R1.fastq in2=reads_R2.fastq hist=kmer_histogram.txt peaks=peaks.txt

Processes paired-end reads to generate a combined kmer histogram with genome size estimation.

High-Memory Analysis with Custom Parameters

khist.sh in=reads.fastq hist=histogram.txt -Xmx64g histlen=10000000 prefilter

Allocates 64GB memory, extends histogram length to capture very high-depth kmers, and explicitly enables prefilter for maximum accuracy.

Quality-Filtered Analysis

khist.sh in=reads.fastq hist=histogram.txt minq=10 minprob=0.8 k=21

Uses quality filtering to exclude unreliable kmers and shorter kmers (21-mers) for highly repetitive genomes or low-coverage data.

Multiple Input Files

khist.sh in=reads1.fastq,reads2.fastq,reads3.fastq hist=combined_histogram.txt

Generates a combined histogram from multiple sequencing files.

Output Format

Histogram File

KHist produces a tab-delimited histogram with the following columns:

#Depth	Count	Fraction
1	15234567	0.45123
2	8901234	0.26234
3	5432109	0.16345
...

Column Descriptions:

Depth: Kmer count (how many times the kmer appears)
Count: Number of unique kmers with this depth
Fraction: Proportion of total unique kmers at this depth

Peaks File

When peaks=file is specified, KHist generates a peaks analysis file containing:

Peak Locations: Coverage depths corresponding to genomic peaks
Genome Size Estimates: Calculated haploid genome size based on peak analysis
Ploidy Predictions: Estimated genome ploidy from peak ratios
Quality Metrics: Peak confidence and coverage statistics

Note: Peak analysis is only accurate for randomly-sheared isolate genomic DNA with minimal contamination and ploidy ≤4.

Statistics Summary

Before the histogram, KHist outputs summary statistics including:

Hash table configuration (cells, bits, hashes, load factor)
Memory usage and capacity warnings
Estimated number of unique kmers
Processing time and throughput
Load factor warnings if table becomes too full

Applications

Genome Size Estimation

Primary use case for determining haploid genome size from sequencing data. KHist identifies the coverage peak corresponding to single-copy genomic sequences and calculates total genome size using the formula: genome_size = total_bases / peak_coverage.

Coverage Distribution Analysis

Reveals sequencing depth uniformity and identifies regions of unusual coverage that may indicate repetitive elements, contamination, or technical artifacts.

Assembly Planning

Kmer depth distributions help choose appropriate parameters for genome assemblers by revealing the expected coverage, error rates, and repetitive content structure.

Data Quality Assessment

The shape of low-depth regions indicates sequencing errors, adapter contamination, or other data quality issues. High error rates appear as elevated counts at depth 1-2.

Contamination Detection

Unexpected peaks may indicate contaminating DNA from other organisms, adapter sequences, or PCR duplicates that deviate from expected genomic coverage patterns.

Repetitive Element Analysis

High-depth peaks in the histogram indicate repetitive elements, allowing quantitative assessment of genome complexity and repeat content distribution.

Performance

Memory Requirements

Memory usage depends on the hash table size, which is automatically calculated based on available system memory. KHist uses approximately 73% of available RAM for the hash table, with the remainder reserved for Java overhead and histogram arrays.

Memory Efficiency Strategy

The Count-Min Sketch approach provides several advantages:

Bounded Memory: Never exceeds allocated memory regardless of input size
Graceful Degradation: Accuracy decreases gradually as load factor increases
Prefilter Optimization: Two-tier system maximizes accuracy for high-count kmers
Collision Management: Multiple hash functions minimize false positives

Speed Optimization

KHist is optimized for speed through:

Single-pass processing (no iterative refinement)
Lock-free atomic counter operations for thread safety
Automatic thread scaling based on CPU cores
Canonical kmer processing
Disabled error correction (not needed for histograms)

Typical Performance

Processing rates vary by system but typically achieve:

100-500 MB/sec on modern multi-core systems
Linear scaling with thread count up to memory bandwidth limits
Memory-bound rather than CPU-bound for most datasets
Constant memory usage regardless of input dataset size

Notes

Memory and Accuracy

KHist requires substantial memory for large datasets. For whole genome sequencing data, allocate at least 16-32 GB RAM. Monitor the hash table load factor in the output:

Under 60%: Excellent accuracy
60-90%: Good for most applications
Over 90%: Warning issued - consider increasing memory or using quality filters

Limitations

KHist cannot dump individual kmers and their counts because the Count-Min Sketch stores only counts, not the kmers themselves. For exact kmer enumeration, use KmerCountExact instead, which explicitly tracks both kmers and their exact counts but requires more memory.

Parameter Selection Guidelines

For most applications, the default parameters work well. Consider adjusting:

k: Smaller values (21-25) for highly repetitive genomes or low-coverage data
minq: Higher values (10-15) for noisy sequencing data
prefilter: Always enable for maximum accuracy when memory is limited
histlen: Increase for data with very high coverage or amplification

Input Compatibility: KHist works with FASTQ and FASTA files (gzipped or uncompressed). For multiple samples, process each separately to maintain genome size estimation accuracy, or concatenate input files for combined analysis.

When NOT to Use KHist

KHist and normalization tools are not appropriate for:

Quantitative Analysis: ChIP-seq, RNA-seq expression profiling, or other applications requiring precise read counts
Variant Discovery: Mapping for variant calling where coverage bias could affect results
High-Error Platforms: PacBio or Nanopore data designed for short, low-error reads
Low Coverage Data: Cannot inflate coverage below assembly thresholds

Related Tools

bbnorm.sh: Full normalization pipeline with integrated histogram generation
kmercountexact.sh: Exact kmer counting without probabilistic approximation
loglog.sh: Memory-bounded cardinality estimation for very large datasets
bbcountunique.sh: Count unique kmers without generating full histograms
ecc.sh: Error correction variant of KmerNormalize