KCompress

Script: kcompress.sh Package: assemble Class: KmerCompressor.java

Compresses sequence data into a fasta file containing each kmer exactly once. Allows arbitrary kmer set operations via multiple passes.

Basic Usage

kcompress.sh in=<reads> out=<contigs> min=<1> max=<2147483647>

KCompress processes sequence data to extract and compress kmers into unique sequences, performing kmer counting and contig assembly through a multi-threaded approach.

Parameters

Parameters are organized by their function in the kmer compression and assembly process.

Input Parameters

in=file: Primary input file for reads to use as kmer data.
in2=file: Second input file for paired data.
reads=-1: Only process this number of reads, then quit (-1 means all).

Output Parameters

out=file: Write contigs (in contig mode).
showstats=t: Print assembly statistics after writing contigs.
fuse=0: Fuse output sequences into chunks at least this long, padded with 1 N between sequences.

Prefiltering Parameters

prefilter=0: If set to a positive integer, use a countmin sketch to ignore kmers with depth of that value or lower.
prehashes=2: Number of hashes for prefilter.
prefiltersize=0.2: (pff) Fraction of memory to use for prefilter.
minprobprefilter=t: (mpp) Use minprob for the prefilter.
prepasses=1: Use this many prefiltering passes; higher be more thorough if the filter is very full. Set to 'auto' to iteratively prefilter until the remaining kmers will fit in memory.

Hashing Parameters

k=31: Kmer length (1 to 31).
prealloc=t: Pre-allocate memory rather than dynamically growing; faster and more memory-efficient. A float fraction (0-1) may be specified; default is 1.
minprob=0.5: Ignore kmers with overall probability of correctness below this.
minprobmain=t: (mpm) Use minprob for the primary kmer counts.
threads=X: Spawn X threads (default is number of logical processors).

Assembly Parameters

mincount=1: (min) Only retain kmers that occur at least this many times.
maxcount=BIG: (max) Only retain kmers that occur at most this many times.
requiresamecount: (rsc) Only build contigs from kmers with exactly the same count.
rcomp=t: Store forward and reverse kmers together. Setting this to false will only use forward kmers.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Kmer Compression

kcompress.sh in=reads.fq out=compressed.fa

Compress all kmers from reads.fq into unique sequences written to compressed.fa.

Filtered Kmer Compression

kcompress.sh in=reads.fq out=compressed.fa mincount=2 maxcount=1000

Only retain kmers that occur between 2 and 1000 times, filtering out singletons and highly repetitive sequences.

Memory-Efficient Processing with Prefilter

kcompress.sh in=reads.fq out=compressed.fa prefilter=1 prefiltersize=0.3 k=25

Use a countmin sketch prefilter to ignore singleton kmers, using 30% of memory for the prefilter and k=25.

Fused Output

kcompress.sh in=reads.fq out=fused.fa fuse=10000

Fuse compressed sequences into chunks of at least 10kb, separated by single N characters.

Same-Count Assembly

kcompress.sh in=reads.fq out=contigs.fa requiresamecount mincount=5

Build contigs only from kmers with exactly the same count, starting with kmers that occur at least 5 times.

Algorithm Details

Kmer Compression Strategy

KCompress implements kmer-based sequence compression using KmerTableSet hash tables with atomic work distribution across multiple BuildThread workers:

Hash Table-Based Storage: Uses specialized hash tables (HashArray1D) with overflow handling via HashForest data structures for efficient kmer storage and retrieval.
Memory-Efficient Processing: Supports prefiltering with countmin sketches to reduce memory usage by ignoring low-frequency kmers before main processing.
Dual Kmer Representation: Maintains both forward and reverse-complement representations when rcomp=true, allowing bidirectional contig extension.
Thread-Safe Operations: Implements atomic ownership claiming for kmers to enable parallel contig building without race conditions.

Contig Building Process

The contig assembly algorithm uses a greedy extension approach:

Seed Selection: Identifies unclaimed kmers within the specified count range (mincount to maxcount) as starting points.
Ownership Claiming: Uses atomic operations to claim exclusive ownership of kmers during extension to prevent conflicts between threads.
Bidirectional Extension: Extends contigs in both directions using the extendToRight method, then reverse-complements and extends again.
Count Consistency: When requiresamecount=true, only extends through kmers with identical occurrence counts.
Termination Conditions: Stops extension at junctions (multiple possible next kmers), dead ends (no valid next kmer), or when maximum length is reached.

Memory Management

KCompress optimizes memory usage through several strategies:

Pre-allocation: When prealloc=true, allocates hash table space upfront for better performance and memory efficiency.
Prefiltering: Optional countmin sketch prefilter removes low-frequency kmers before main processing, reducing memory requirements.
Iterative Processing: Supports auto mode for prepasses, iteratively filtering until remaining kmers fit in available memory.
Thread-Local Storage: Uses thread-local data structures to minimize memory allocation overhead during parallel processing.

Quality Control

The algorithm incorporates quality-aware processing:

Probability Filtering: minprob parameter filters kmers based on overall correctness probability calculated from quality scores.
Count-Based Filtering: mincount and maxcount parameters remove kmers likely to represent errors (too rare) or repetitive elements (too common).
Consistent Assembly: requiresamecount option ensures contigs are built from kmers with uniform coverage, improving assembly quality.

Performance Characteristics

Scalability: Multi-threaded design scales with available CPU cores for both kmer counting and contig building phases.
Memory Efficiency: Typical memory usage is ~12 bytes per unique kmer plus overhead for hash table structure.
I/O Optimization: Uses buffered I/O and supports compressed input files for efficient data processing.
Adaptive Processing: Auto mode for prefiltering adapts to available memory, making the tool usable on systems with varying memory constraints.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org