KmerCountShort
Counts the number of unique kmers in a file. Prints a fasta or tsv file containing all kmers and their counts. Supports K=1 to 15, though values above 8 should use KmerCountExact. Output formats include fasta or tsv with options for reverse-complement merging and minimum count filtering.
Basic Usage
kmercountshort.sh in=<file> out=<file> k=4
KmerCountShort processes sequencing data to count k-mer occurrences. Input may be fasta or fastq, compressed or uncompressed. Output may be stdout or a file.
Parameters
Parameters control input/output locations, k-mer counting behavior, and output filtering.
Input Parameters
- in=<file>
- Primary input file.
- in2=<file>
- Second input file for paired reads.
Output Parameters
- out=<file>
- Print kmers and their counts. Extension sensitive; .fa or .fasta will produce fasta, otherwise tsv.
- mincount=0
- Only print kmers with at least this depth.
- reads=-1
- Only process this number of reads, then quit (-1 means all).
- rcomp=t
- Store and count each kmer together and its reverse-complement.
- comment=
- Denotes start of the tsv header. E.g. 'comment=#'
- skip=1
- Count every Nth kmer. If skip=2, count every 2nd kmer, etc.
Counting Parameters
- k=4
- Kmer length - needs at least (threads+1)*8*4^k memory.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Output Format
TSV Format (default)
Kmer Count
AAAA 1523
AAAC 842
AAAG 1105
AAAT 978
Tab-delimited format with k-mer sequences in the first column and counts in the second column. Optional comment header can be specified with the comment parameter.
Fasta Format (.fa or .fasta extension)
>1523
AAAA
>842
AAAC
>1105
AAAG
>978
AAAT
Fasta format with counts in the header line and k-mer sequences as the sequence lines.
Examples
Basic K-mer Counting
kmercountshort.sh in=reads.fq out=kmers.txt k=4
Count all 4-mers in a fastq file and output results in TSV format.
Fasta Output with Filtering
kmercountshort.sh in=genome.fa out=kmers.fasta k=6 mincount=10
Count 6-mers and output only those with at least 10 occurrences in fasta format.
Paired-End Reads with Reverse-Complement Merging
kmercountshort.sh in=read1.fq in2=read2.fq out=kmers.txt k=5 rcomp=t
Count 5-mers from paired reads, merging each k-mer with its reverse complement.
Sampling with Skip Parameter
kmercountshort.sh in=large.fq out=sample_kmers.txt k=4 skip=10
Count every 10th k-mer for faster processing of large datasets.
Limited Read Processing
kmercountshort.sh in=reads.fq out=kmers.txt k=4 reads=100000
Process only the first 100,000 reads for quick k-mer profiling.
Custom TSV Header
kmercountshort.sh in=reads.fq out=kmers.txt k=4 comment=#
Add a comment character '#' to the beginning of the TSV header line.
Algorithm Details
K-mer Encoding
KmerCountShort uses efficient bit-shifting to encode k-mers as integers:
- Base Encoding: A=0, C=1, G=2, T=3 (2 bits per base)
- Bit-Shifting:
kmer=((kmer<<2)|x)&mask- shifts left 2 bits and adds new base - Masking:
mask=~((-1)<<bits)where bits=2*k - keeps only the rightmost k bases - Array Indexing: K-mers directly index into counts array of size 4^k
Memory Requirements
Memory usage is determined by the k-mer size and number of threads:
- Formula: (threads+1) * 8 * 4^k bytes minimum
- Example for k=4: (8+1) * 8 * 256 = ~18 KB per thread
- Example for k=8: (8+1) * 8 * 65,536 = ~4.5 MB per thread
- Example for k=15: (8+1) * 8 * 1,073,741,824 = ~77 GB per thread
Note: For k > 8, use KmerCountExact which uses a different data structure optimized for larger k-mer sizes.
Reverse-Complement Merging
When rcomp=t (default), k-mers are merged with their reverse complements:
- Canonical K-mers: Each k-mer and its reverse complement are counted together
- Implementation: Uses
AminoAcid.reverseComplementBinaryFast(kmer, k)for efficient reverse complementation - Output Filtering: Only the lexicographically smaller of a k-mer/reverse-complement pair is printed
- Count Summation:
count=(rcomp && kmer!=rkmer) ? a+b : a- adds both counts if different
Multi-threaded Processing
KmerCountShort uses parallel processing for efficiency:
- Per-Thread Counts: Each thread maintains its own counts array to avoid synchronization
- Thread Assignment: Reads distributed across threads via ConcurrentReadInputStream
- Accumulation: Thread-local counts merged into global counts array after processing
- Synchronized Merging:
Tools.add(counts, pt.countsT)safely combines thread results
Skip Parameter Optimization
The skip parameter provides controlled subsampling:
- Default (skip=1): Counts every k-mer using
addKmers(bases, counts, k) - Skip Mode: Uses
addKmers(bases, counts, k, skip)with modulo check:len%skip==0 - Use Case: Faster profiling of large datasets when approximate counts are sufficient
- Performance: skip=10 processes ~10x faster with proportionally smaller counts
Support
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.