CommonKmers

Script: commonkmers.sh Package: jgi Class: SmallKmerFrequency.java

Prints the most common kmers in each sequence. This is intended for short kmers only!

Basic Usage

commonkmers.sh in=<file> out=<file>

This tool analyzes input sequences and identifies the most frequently occurring k-mers within each sequence. The output shows the sequence ID followed by the most common k-mers, optionally with their occurrence counts.

Parameters

Parameters control k-mer analysis settings, output formatting, and file handling options.

Core Parameters

k=2
K-mer length to analyze. Valid range: 0-12. Short k-mers (2-6) are most useful for compositional analysis, while longer k-mers provide more specific sequence patterns. Default is 2 (dinucleotides).
display=3
Number of most common k-mers to print per sequence. The tool will display this many of the highest-frequency k-mers for each input sequence, ranked by occurrence count.
count=f
Print the k-mer counts as well as the k-mer sequences. When set to true, output format becomes "kmer=count" instead of just "kmer". Useful for quantitative analysis of k-mer frequencies.

File Handling Parameters

ow=f
(overwrite) Overwrites files that already exist. Set to true to allow overwriting of existing output files without prompting.
app=f
(append) Append to files that already exist. When true, new results are added to the end of existing files rather than overwriting them.
zl=4
(ziplevel) Set compression level for output files. Range: 1 (fastest, least compression) to 9 (slowest, maximum compression). Level 4 provides good balance of speed and compression ratio.
qin=auto
ASCII offset for input quality scores in FASTQ files. Options: 33 (Sanger/Illumina 1.8+), 64 (Illumina 1.3-1.7), or auto (automatic detection). Only affects FASTQ input processing.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default allocation is 800MB for this tool.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to prevent hanging on memory exhaustion.
-da
Disable assertions. Can provide minor performance improvement in production environments by skipping internal consistency checks.

Examples

Basic K-mer Analysis

commonkmers.sh in=sequences.fasta out=kmers.txt k=3 display=5

Analyze 3-mer frequencies in sequences, showing the top 5 most common trimers for each sequence.

Dinucleotide Composition with Counts

commonkmers.sh in=reads.fastq out=dinuc_profile.txt k=2 count=t display=4

Generate dinucleotide composition profiles with occurrence counts, showing the 4 most frequent dinucleotides per sequence.

Short K-mer Screening

commonkmers.sh in=contigs.fasta out=composition.txt k=4 display=10 count=t

Screen contigs for 4-mer composition bias by displaying the top 10 tetramers with their frequencies, useful for detecting repetitive elements or contamination.

Quality-aware Analysis

commonkmers.sh in=raw_reads.fq out=kmer_analysis.txt k=3 qin=33 display=6

Analyze k-mer composition in FASTQ data with explicit quality encoding specification, displaying top 6 trimers per read.

Algorithm Details

K-mer Indexing Strategy

CommonKmers uses an efficient k-mer indexing approach optimized for short k-mers (0-12 nucleotides). The algorithm employs several key optimizations:

Memory Management

The algorithm maintains separate data structures for k-mer counting and result formatting:

Processing Workflow

  1. Initialization: Create k-mer index mapping and allocate count arrays based on k-mer length
  2. Sequence Processing: For each sequence, slide a window of size k and count occurrences of each canonical k-mer
  3. Frequency Ranking: Sort k-mers by frequency count in descending order
  4. Output Generation: Format and output the top N most frequent k-mers per sequence

Limitations and Considerations

Performance Characteristics

Time complexity is O(L) per sequence where L is sequence length, with additional O(4^k log(4^k)) sorting cost per sequence. Memory usage is dominated by the O(4^k) count array, making this tool most efficient for short k-mers where detailed compositional analysis is needed.

Output Format

The output format depends on the count parameter setting:

Without Counts (count=f, default)

sequence_id    kmer1    kmer2    kmer3    ...

Each line contains the sequence identifier followed by tab-separated k-mers in descending frequency order.

With Counts (count=t)

sequence_id    kmer1=count1    kmer2=count2    kmer3=count3    ...

Each k-mer is followed by an equals sign and its occurrence count within that sequence.

Example Output

# Without counts (count=f)
seq1    AT    GC    TA
seq2    CG    GC    AT

# With counts (count=t)  
seq1    AT=15    GC=12    TA=8
seq2    CG=22    GC=18    AT=7

Use Cases

Support

For questions and support: