KmerCountShort

Script: kmercountshort.sh Package: jgi Class: KmerCountShort.java

Counts the number of unique kmers in a file. Prints a fasta or tsv file containing all kmers and their counts. Supports K=1 to 15, though values above 8 should use KmerCountExact. Output formats include fasta or tsv with options for reverse-complement merging and minimum count filtering.

Basic Usage

kmercountshort.sh in=<file> out=<file> k=4

KmerCountShort processes sequencing data to count k-mer occurrences. Input may be fasta or fastq, compressed or uncompressed. Output may be stdout or a file.

Parameters

Parameters control input/output locations, k-mer counting behavior, and output filtering.

Input Parameters

in=<file>: Primary input file.
in2=<file>: Second input file for paired reads.

Output Parameters

out=<file>: Print kmers and their counts. Extension sensitive; .fa or .fasta will produce fasta, otherwise tsv.
mincount=0: Only print kmers with at least this depth.
reads=-1: Only process this number of reads, then quit (-1 means all).
rcomp=t: Store and count each kmer together and its reverse-complement.
comment=: Denotes start of the tsv header. E.g. 'comment=#'
skip=1: Count every Nth kmer. If skip=2, count every 2nd kmer, etc.

Counting Parameters

k=4: Kmer length - needs at least (threads+1)*8*4^k memory.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Output Format

TSV Format (default)

Kmer    Count
AAAA    1523
AAAC    842
AAAG    1105
AAAT    978

Tab-delimited format with k-mer sequences in the first column and counts in the second column. Optional comment header can be specified with the comment parameter.

Fasta Format (.fa or .fasta extension)

>1523
AAAA
>842
AAAC
>1105
AAAG
>978
AAAT

Fasta format with counts in the header line and k-mer sequences as the sequence lines.

Examples

Basic K-mer Counting

kmercountshort.sh in=reads.fq out=kmers.txt k=4

Count all 4-mers in a fastq file and output results in TSV format.

Fasta Output with Filtering

kmercountshort.sh in=genome.fa out=kmers.fasta k=6 mincount=10

Count 6-mers and output only those with at least 10 occurrences in fasta format.

Paired-End Reads with Reverse-Complement Merging

kmercountshort.sh in=read1.fq in2=read2.fq out=kmers.txt k=5 rcomp=t

Count 5-mers from paired reads, merging each k-mer with its reverse complement.

Sampling with Skip Parameter

kmercountshort.sh in=large.fq out=sample_kmers.txt k=4 skip=10

Count every 10th k-mer for faster processing of large datasets.

Limited Read Processing

kmercountshort.sh in=reads.fq out=kmers.txt k=4 reads=100000

Process only the first 100,000 reads for quick k-mer profiling.

Custom TSV Header

kmercountshort.sh in=reads.fq out=kmers.txt k=4 comment=#

Add a comment character '#' to the beginning of the TSV header line.

Algorithm Details

K-mer Encoding

KmerCountShort uses efficient bit-shifting to encode k-mers as integers:

Base Encoding: A=0, C=1, G=2, T=3 (2 bits per base)
Bit-Shifting: kmer=((kmer<<2)|x)&mask - shifts left 2 bits and adds new base
Masking: mask=~((-1)<<bits) where bits=2*k - keeps only the rightmost k bases
Array Indexing: K-mers directly index into counts array of size 4^k

Memory Requirements

Memory usage is determined by the k-mer size and number of threads:

Formula: (threads+1) * 8 * 4^k bytes minimum
Example for k=4: (8+1) * 8 * 256 = ~18 KB per thread
Example for k=8: (8+1) * 8 * 65,536 = ~4.5 MB per thread
Example for k=15: (8+1) * 8 * 1,073,741,824 = ~77 GB per thread

Note: For k > 8, use KmerCountExact which uses a different data structure optimized for larger k-mer sizes.

Reverse-Complement Merging

When rcomp=t (default), k-mers are merged with their reverse complements:

Canonical K-mers: Each k-mer and its reverse complement are counted together
Implementation: Uses AminoAcid.reverseComplementBinaryFast(kmer, k) for efficient reverse complementation
Output Filtering: Only the lexicographically smaller of a k-mer/reverse-complement pair is printed
Count Summation: count=(rcomp && kmer!=rkmer) ? a+b : a - adds both counts if different

Multi-threaded Processing

KmerCountShort uses parallel processing for efficiency:

Per-Thread Counts: Each thread maintains its own counts array to avoid synchronization
Thread Assignment: Reads distributed across threads via ConcurrentReadInputStream
Accumulation: Thread-local counts merged into global counts array after processing
Synchronized Merging: Tools.add(counts, pt.countsT) safely combines thread results

Skip Parameter Optimization

The skip parameter provides controlled subsampling:

Default (skip=1): Counts every k-mer using addKmers(bases, counts, k)
Skip Mode: Uses addKmers(bases, counts, k, skip) with modulo check: len%skip==0
Use Case: Faster profiling of large datasets when approximate counts are sufficient
Performance: skip=10 processes ~10x faster with proportionally smaller counts

Support

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.