KmerCountShort

Script: kmercountshort.sh Package: jgi Class: KmerCountShort.java

Counts the number of unique kmers in a file. Prints a fasta or tsv file containing all kmers and their counts. Supports K=1 to 15, though values above 8 should use KmerCountExact. Output formats include fasta or tsv with options for reverse-complement merging and minimum count filtering.

Basic Usage

kmercountshort.sh in=<file> out=<file> k=4

KmerCountShort processes sequencing data to count k-mer occurrences. Input may be fasta or fastq, compressed or uncompressed. Output may be stdout or a file.

Parameters

Parameters control input/output locations, k-mer counting behavior, and output filtering.

Input Parameters

in=<file>
Primary input file.
in2=<file>
Second input file for paired reads.

Output Parameters

out=<file>
Print kmers and their counts. Extension sensitive; .fa or .fasta will produce fasta, otherwise tsv.
mincount=0
Only print kmers with at least this depth.
reads=-1
Only process this number of reads, then quit (-1 means all).
rcomp=t
Store and count each kmer together and its reverse-complement.
comment=
Denotes start of the tsv header. E.g. 'comment=#'
skip=1
Count every Nth kmer. If skip=2, count every 2nd kmer, etc.

Counting Parameters

k=4
Kmer length - needs at least (threads+1)*8*4^k memory.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Output Format

TSV Format (default)

Kmer    Count
AAAA    1523
AAAC    842
AAAG    1105
AAAT    978

Tab-delimited format with k-mer sequences in the first column and counts in the second column. Optional comment header can be specified with the comment parameter.

Fasta Format (.fa or .fasta extension)

>1523
AAAA
>842
AAAC
>1105
AAAG
>978
AAAT

Fasta format with counts in the header line and k-mer sequences as the sequence lines.

Examples

Basic K-mer Counting

kmercountshort.sh in=reads.fq out=kmers.txt k=4

Count all 4-mers in a fastq file and output results in TSV format.

Fasta Output with Filtering

kmercountshort.sh in=genome.fa out=kmers.fasta k=6 mincount=10

Count 6-mers and output only those with at least 10 occurrences in fasta format.

Paired-End Reads with Reverse-Complement Merging

kmercountshort.sh in=read1.fq in2=read2.fq out=kmers.txt k=5 rcomp=t

Count 5-mers from paired reads, merging each k-mer with its reverse complement.

Sampling with Skip Parameter

kmercountshort.sh in=large.fq out=sample_kmers.txt k=4 skip=10

Count every 10th k-mer for faster processing of large datasets.

Limited Read Processing

kmercountshort.sh in=reads.fq out=kmers.txt k=4 reads=100000

Process only the first 100,000 reads for quick k-mer profiling.

Custom TSV Header

kmercountshort.sh in=reads.fq out=kmers.txt k=4 comment=#

Add a comment character '#' to the beginning of the TSV header line.

Algorithm Details

K-mer Encoding

KmerCountShort uses efficient bit-shifting to encode k-mers as integers:

Memory Requirements

Memory usage is determined by the k-mer size and number of threads:

Note: For k > 8, use KmerCountExact which uses a different data structure optimized for larger k-mer sizes.

Reverse-Complement Merging

When rcomp=t (default), k-mers are merged with their reverse complements:

Multi-threaded Processing

KmerCountShort uses parallel processing for efficiency:

Skip Parameter Optimization

The skip parameter provides controlled subsampling:

Support

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.