CommonKmers

Basic Usage

commonkmers.sh in=<file> out=<file>

This tool analyzes input sequences and identifies the most frequently occurring k-mers within each sequence. The output shows the sequence ID followed by the most common k-mers, optionally with their occurrence counts.

Parameters

Parameters control k-mer analysis settings, output formatting, and file handling options.

Core Parameters

k=2: K-mer length to analyze. Valid range: 0-12. Short k-mers (2-6) are most useful for compositional analysis, while longer k-mers provide more specific sequence patterns. Default is 2 (dinucleotides).
display=3: Number of most common k-mers to print per sequence. The tool will display this many of the highest-frequency k-mers for each input sequence, ranked by occurrence count.
count=f: Print the k-mer counts as well as the k-mer sequences. When set to true, output format becomes "kmer=count" instead of just "kmer". Useful for quantitative analysis of k-mer frequencies.

File Handling Parameters

ow=f: (overwrite) Overwrites files that already exist. Set to true to allow overwriting of existing output files without prompting.
app=f: (append) Append to files that already exist. When true, new results are added to the end of existing files rather than overwriting them.
zl=4: (ziplevel) Set compression level for output files. Range: 1 (fastest, least compression) to 9 (slowest, maximum compression). Level 4 provides good balance of speed and compression ratio.
qin=auto: ASCII offset for input quality scores in FASTQ files. Options: 33 (Sanger/Illumina 1.8+), 64 (Illumina 1.3-1.7), or auto (automatic detection). Only affects FASTQ input processing.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default allocation is 800MB for this tool.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to prevent hanging on memory exhaustion.
-da: Disable assertions. Can provide minor performance improvement in production environments by skipping internal consistency checks.

Examples

Basic K-mer Analysis

commonkmers.sh in=sequences.fasta out=kmers.txt k=3 display=5

Analyze 3-mer frequencies in sequences, showing the top 5 most common trimers for each sequence.

Dinucleotide Composition with Counts

commonkmers.sh in=reads.fastq out=dinuc_profile.txt k=2 count=t display=4

Generate dinucleotide composition profiles with occurrence counts, showing the 4 most frequent dinucleotides per sequence.

Short K-mer Screening

commonkmers.sh in=contigs.fasta out=composition.txt k=4 display=10 count=t

Screen contigs for 4-mer composition bias by displaying the top 10 tetramers with their frequencies, useful for detecting repetitive elements or contamination.

Quality-aware Analysis

commonkmers.sh in=raw_reads.fq out=kmer_analysis.txt k=3 qin=33 display=6

Analyze k-mer composition in FASTQ data with explicit quality encoding specification, displaying top 6 trimers per read.

Algorithm Details

K-mer Indexing Strategy

CommonKmers uses an efficient k-mer indexing approach optimized for short k-mers (0-12 nucleotides). The algorithm employs several key optimizations:

Reverse Complement Normalization: K-mers and their reverse complements are treated as equivalent by always using the lexicographically smaller representation. This reduces memory usage by half and provides consistent results regardless of strand orientation.
Binary Encoding: K-mers are encoded as binary integers using 2 bits per nucleotide (A=00, C=01, G=10, T=11), enabling fast bitwise operations for k-mer manipulation and comparison.
Pre-computed Index Arrays: The tool pre-computes index mappings that group canonical k-mers with their reverse complements, eliminating redundant counting and ensuring consistent indexing.
Dual Sorting Strategy: Results are sorted twice - first by numerical k-mer index to reset count arrays efficiently, then by frequency count using custom comparators to identify the most common k-mers.

Memory Management

The algorithm maintains separate data structures for k-mer counting and result formatting:

Count Arrays: Fixed-size integer arrays (size = 4^k / 2) store occurrence frequencies for all possible canonical k-mers
Kmer Objects: Lightweight objects containing string representation, count, and numerical index for efficient sorting
StringBuilder Optimization: Reused StringBuilder instance minimizes memory allocation during output formatting

Processing Workflow

Initialization: Create k-mer index mapping and allocate count arrays based on k-mer length
Sequence Processing: For each sequence, slide a window of size k and count occurrences of each canonical k-mer
Frequency Ranking: Sort k-mers by frequency count in descending order
Output Generation: Format and output the top N most frequent k-mers per sequence

Limitations and Considerations

K-mer Length Restriction: Limited to k-mers of length 0-12 to maintain reasonable memory usage (4^12 = 16M possible k-mers maximum)
Memory Scaling: Memory usage scales exponentially with k-mer length: 4^k integers for the count array
Single-threaded Processing: Current implementation processes sequences sequentially rather than in parallel

Performance Characteristics

Time complexity is O(L) per sequence where L is sequence length, with additional O(4^k log(4^k)) sorting cost per sequence. Memory usage is dominated by the O(4^k) count array, making this tool most efficient for short k-mers where detailed compositional analysis is needed.

Output Format

The output format depends on the count parameter setting:

Without Counts (count=f, default)

sequence_id    kmer1    kmer2    kmer3    ...

Each line contains the sequence identifier followed by tab-separated k-mers in descending frequency order.

With Counts (count=t)

sequence_id    kmer1=count1    kmer2=count2    kmer3=count3    ...

Each k-mer is followed by an equals sign and its occurrence count within that sequence.

Example Output

# Without counts (count=f)
seq1    AT    GC    TA
seq2    CG    GC    AT

# With counts (count=t)  
seq1    AT=15    GC=12    TA=8
seq2    CG=22    GC=18    AT=7

Use Cases

Composition Analysis: Characterize nucleotide composition bias in sequences or sequence collections
Quality Assessment: Detect adapter sequences, contamination, or unusual sequence patterns
Comparative Genomics: Compare k-mer profiles between different samples or treatments
Sequence Classification: Use k-mer signatures for rapid sequence type identification
Preprocessing Step: Generate k-mer profiles for downstream machine learning or statistical analysis
Repeat Element Detection: Identify highly repetitive short motifs within sequences

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org