KCompress

Script: kcompress.sh Package: assemble Class: KmerCompressor.java

Compresses sequence data into a fasta file containing each kmer exactly once. Allows arbitrary kmer set operations via multiple passes.

Basic Usage

kcompress.sh in=<reads> out=<contigs> min=<1> max=<2147483647>

KCompress processes sequence data to extract and compress kmers into unique sequences, performing kmer counting and contig assembly through a multi-threaded approach.

Parameters

Parameters are organized by their function in the kmer compression and assembly process.

Input Parameters

in=file
Primary input file for reads to use as kmer data.
in2=file
Second input file for paired data.
reads=-1
Only process this number of reads, then quit (-1 means all).

Output Parameters

out=file
Write contigs (in contig mode).
showstats=t
Print assembly statistics after writing contigs.
fuse=0
Fuse output sequences into chunks at least this long, padded with 1 N between sequences.

Prefiltering Parameters

prefilter=0
If set to a positive integer, use a countmin sketch to ignore kmers with depth of that value or lower.
prehashes=2
Number of hashes for prefilter.
prefiltersize=0.2
(pff) Fraction of memory to use for prefilter.
minprobprefilter=t
(mpp) Use minprob for the prefilter.
prepasses=1
Use this many prefiltering passes; higher be more thorough if the filter is very full. Set to 'auto' to iteratively prefilter until the remaining kmers will fit in memory.

Hashing Parameters

k=31
Kmer length (1 to 31).
prealloc=t
Pre-allocate memory rather than dynamically growing; faster and more memory-efficient. A float fraction (0-1) may be specified; default is 1.
minprob=0.5
Ignore kmers with overall probability of correctness below this.
minprobmain=t
(mpm) Use minprob for the primary kmer counts.
threads=X
Spawn X threads (default is number of logical processors).

Assembly Parameters

mincount=1
(min) Only retain kmers that occur at least this many times.
maxcount=BIG
(max) Only retain kmers that occur at most this many times.
requiresamecount
(rsc) Only build contigs from kmers with exactly the same count.
rcomp=t
Store forward and reverse kmers together. Setting this to false will only use forward kmers.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Kmer Compression

kcompress.sh in=reads.fq out=compressed.fa

Compress all kmers from reads.fq into unique sequences written to compressed.fa.

Filtered Kmer Compression

kcompress.sh in=reads.fq out=compressed.fa mincount=2 maxcount=1000

Only retain kmers that occur between 2 and 1000 times, filtering out singletons and highly repetitive sequences.

Memory-Efficient Processing with Prefilter

kcompress.sh in=reads.fq out=compressed.fa prefilter=1 prefiltersize=0.3 k=25

Use a countmin sketch prefilter to ignore singleton kmers, using 30% of memory for the prefilter and k=25.

Fused Output

kcompress.sh in=reads.fq out=fused.fa fuse=10000

Fuse compressed sequences into chunks of at least 10kb, separated by single N characters.

Same-Count Assembly

kcompress.sh in=reads.fq out=contigs.fa requiresamecount mincount=5

Build contigs only from kmers with exactly the same count, starting with kmers that occur at least 5 times.

Algorithm Details

Kmer Compression Strategy

KCompress implements kmer-based sequence compression using KmerTableSet hash tables with atomic work distribution across multiple BuildThread workers:

Contig Building Process

The contig assembly algorithm uses a greedy extension approach:

  1. Seed Selection: Identifies unclaimed kmers within the specified count range (mincount to maxcount) as starting points.
  2. Ownership Claiming: Uses atomic operations to claim exclusive ownership of kmers during extension to prevent conflicts between threads.
  3. Bidirectional Extension: Extends contigs in both directions using the extendToRight method, then reverse-complements and extends again.
  4. Count Consistency: When requiresamecount=true, only extends through kmers with identical occurrence counts.
  5. Termination Conditions: Stops extension at junctions (multiple possible next kmers), dead ends (no valid next kmer), or when maximum length is reached.

Memory Management

KCompress optimizes memory usage through several strategies:

Quality Control

The algorithm incorporates quality-aware processing:

Performance Characteristics

Support

For questions and support: