KCompress
Compresses sequence data into a fasta file containing each kmer exactly once. Allows arbitrary kmer set operations via multiple passes.
Basic Usage
kcompress.sh in=<reads> out=<contigs> min=<1> max=<2147483647>
KCompress processes sequence data to extract and compress kmers into unique sequences, performing kmer counting and contig assembly through a multi-threaded approach.
Parameters
Parameters are organized by their function in the kmer compression and assembly process.
Input Parameters
- in=file
- Primary input file for reads to use as kmer data.
- in2=file
- Second input file for paired data.
- reads=-1
- Only process this number of reads, then quit (-1 means all).
Output Parameters
- out=file
- Write contigs (in contig mode).
- showstats=t
- Print assembly statistics after writing contigs.
- fuse=0
- Fuse output sequences into chunks at least this long, padded with 1 N between sequences.
Prefiltering Parameters
- prefilter=0
- If set to a positive integer, use a countmin sketch to ignore kmers with depth of that value or lower.
- prehashes=2
- Number of hashes for prefilter.
- prefiltersize=0.2
- (pff) Fraction of memory to use for prefilter.
- minprobprefilter=t
- (mpp) Use minprob for the prefilter.
- prepasses=1
- Use this many prefiltering passes; higher be more thorough if the filter is very full. Set to 'auto' to iteratively prefilter until the remaining kmers will fit in memory.
Hashing Parameters
- k=31
- Kmer length (1 to 31).
- prealloc=t
- Pre-allocate memory rather than dynamically growing; faster and more memory-efficient. A float fraction (0-1) may be specified; default is 1.
- minprob=0.5
- Ignore kmers with overall probability of correctness below this.
- minprobmain=t
- (mpm) Use minprob for the primary kmer counts.
- threads=X
- Spawn X threads (default is number of logical processors).
Assembly Parameters
- mincount=1
- (min) Only retain kmers that occur at least this many times.
- maxcount=BIG
- (max) Only retain kmers that occur at most this many times.
- requiresamecount
- (rsc) Only build contigs from kmers with exactly the same count.
- rcomp=t
- Store forward and reverse kmers together. Setting this to false will only use forward kmers.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Kmer Compression
kcompress.sh in=reads.fq out=compressed.fa
Compress all kmers from reads.fq into unique sequences written to compressed.fa.
Filtered Kmer Compression
kcompress.sh in=reads.fq out=compressed.fa mincount=2 maxcount=1000
Only retain kmers that occur between 2 and 1000 times, filtering out singletons and highly repetitive sequences.
Memory-Efficient Processing with Prefilter
kcompress.sh in=reads.fq out=compressed.fa prefilter=1 prefiltersize=0.3 k=25
Use a countmin sketch prefilter to ignore singleton kmers, using 30% of memory for the prefilter and k=25.
Fused Output
kcompress.sh in=reads.fq out=fused.fa fuse=10000
Fuse compressed sequences into chunks of at least 10kb, separated by single N characters.
Same-Count Assembly
kcompress.sh in=reads.fq out=contigs.fa requiresamecount mincount=5
Build contigs only from kmers with exactly the same count, starting with kmers that occur at least 5 times.
Algorithm Details
Kmer Compression Strategy
KCompress implements kmer-based sequence compression using KmerTableSet hash tables with atomic work distribution across multiple BuildThread workers:
- Hash Table-Based Storage: Uses specialized hash tables (HashArray1D) with overflow handling via HashForest data structures for efficient kmer storage and retrieval.
- Memory-Efficient Processing: Supports prefiltering with countmin sketches to reduce memory usage by ignoring low-frequency kmers before main processing.
- Dual Kmer Representation: Maintains both forward and reverse-complement representations when rcomp=true, allowing bidirectional contig extension.
- Thread-Safe Operations: Implements atomic ownership claiming for kmers to enable parallel contig building without race conditions.
Contig Building Process
The contig assembly algorithm uses a greedy extension approach:
- Seed Selection: Identifies unclaimed kmers within the specified count range (mincount to maxcount) as starting points.
- Ownership Claiming: Uses atomic operations to claim exclusive ownership of kmers during extension to prevent conflicts between threads.
- Bidirectional Extension: Extends contigs in both directions using the extendToRight method, then reverse-complements and extends again.
- Count Consistency: When requiresamecount=true, only extends through kmers with identical occurrence counts.
- Termination Conditions: Stops extension at junctions (multiple possible next kmers), dead ends (no valid next kmer), or when maximum length is reached.
Memory Management
KCompress optimizes memory usage through several strategies:
- Pre-allocation: When prealloc=true, allocates hash table space upfront for better performance and memory efficiency.
- Prefiltering: Optional countmin sketch prefilter removes low-frequency kmers before main processing, reducing memory requirements.
- Iterative Processing: Supports auto mode for prepasses, iteratively filtering until remaining kmers fit in available memory.
- Thread-Local Storage: Uses thread-local data structures to minimize memory allocation overhead during parallel processing.
Quality Control
The algorithm incorporates quality-aware processing:
- Probability Filtering: minprob parameter filters kmers based on overall correctness probability calculated from quality scores.
- Count-Based Filtering: mincount and maxcount parameters remove kmers likely to represent errors (too rare) or repetitive elements (too common).
- Consistent Assembly: requiresamecount option ensures contigs are built from kmers with uniform coverage, improving assembly quality.
Performance Characteristics
- Scalability: Multi-threaded design scales with available CPU cores for both kmer counting and contig building phases.
- Memory Efficiency: Typical memory usage is ~12 bytes per unique kmer plus overhead for hash table structure.
- I/O Optimization: Uses buffered I/O and supports compressed input files for efficient data processing.
- Adaptive Processing: Auto mode for prefiltering adapts to available memory, making the tool usable on systems with varying memory constraints.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org