CommonKmers
Prints the most common kmers in each sequence. This is intended for short kmers only!
Basic Usage
commonkmers.sh in=<file> out=<file>
This tool analyzes input sequences and identifies the most frequently occurring k-mers within each sequence. The output shows the sequence ID followed by the most common k-mers, optionally with their occurrence counts.
Parameters
Parameters control k-mer analysis settings, output formatting, and file handling options.
Core Parameters
- k=2
- K-mer length to analyze. Valid range: 0-12. Short k-mers (2-6) are most useful for compositional analysis, while longer k-mers provide more specific sequence patterns. Default is 2 (dinucleotides).
- display=3
- Number of most common k-mers to print per sequence. The tool will display this many of the highest-frequency k-mers for each input sequence, ranked by occurrence count.
- count=f
- Print the k-mer counts as well as the k-mer sequences. When set to true, output format becomes "kmer=count" instead of just "kmer". Useful for quantitative analysis of k-mer frequencies.
File Handling Parameters
- ow=f
- (overwrite) Overwrites files that already exist. Set to true to allow overwriting of existing output files without prompting.
- app=f
- (append) Append to files that already exist. When true, new results are added to the end of existing files rather than overwriting them.
- zl=4
- (ziplevel) Set compression level for output files. Range: 1 (fastest, least compression) to 9 (slowest, maximum compression). Level 4 provides good balance of speed and compression ratio.
- qin=auto
- ASCII offset for input quality scores in FASTQ files. Options: 33 (Sanger/Illumina 1.8+), 64 (Illumina 1.3-1.7), or auto (automatic detection). Only affects FASTQ input processing.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default allocation is 800MB for this tool.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to prevent hanging on memory exhaustion.
- -da
- Disable assertions. Can provide minor performance improvement in production environments by skipping internal consistency checks.
Examples
Basic K-mer Analysis
commonkmers.sh in=sequences.fasta out=kmers.txt k=3 display=5
Analyze 3-mer frequencies in sequences, showing the top 5 most common trimers for each sequence.
Dinucleotide Composition with Counts
commonkmers.sh in=reads.fastq out=dinuc_profile.txt k=2 count=t display=4
Generate dinucleotide composition profiles with occurrence counts, showing the 4 most frequent dinucleotides per sequence.
Short K-mer Screening
commonkmers.sh in=contigs.fasta out=composition.txt k=4 display=10 count=t
Screen contigs for 4-mer composition bias by displaying the top 10 tetramers with their frequencies, useful for detecting repetitive elements or contamination.
Quality-aware Analysis
commonkmers.sh in=raw_reads.fq out=kmer_analysis.txt k=3 qin=33 display=6
Analyze k-mer composition in FASTQ data with explicit quality encoding specification, displaying top 6 trimers per read.
Algorithm Details
K-mer Indexing Strategy
CommonKmers uses an efficient k-mer indexing approach optimized for short k-mers (0-12 nucleotides). The algorithm employs several key optimizations:
- Reverse Complement Normalization: K-mers and their reverse complements are treated as equivalent by always using the lexicographically smaller representation. This reduces memory usage by half and provides consistent results regardless of strand orientation.
- Binary Encoding: K-mers are encoded as binary integers using 2 bits per nucleotide (A=00, C=01, G=10, T=11), enabling fast bitwise operations for k-mer manipulation and comparison.
- Pre-computed Index Arrays: The tool pre-computes index mappings that group canonical k-mers with their reverse complements, eliminating redundant counting and ensuring consistent indexing.
- Dual Sorting Strategy: Results are sorted twice - first by numerical k-mer index to reset count arrays efficiently, then by frequency count using custom comparators to identify the most common k-mers.
Memory Management
The algorithm maintains separate data structures for k-mer counting and result formatting:
- Count Arrays: Fixed-size integer arrays (size = 4^k / 2) store occurrence frequencies for all possible canonical k-mers
- Kmer Objects: Lightweight objects containing string representation, count, and numerical index for efficient sorting
- StringBuilder Optimization: Reused StringBuilder instance minimizes memory allocation during output formatting
Processing Workflow
- Initialization: Create k-mer index mapping and allocate count arrays based on k-mer length
- Sequence Processing: For each sequence, slide a window of size k and count occurrences of each canonical k-mer
- Frequency Ranking: Sort k-mers by frequency count in descending order
- Output Generation: Format and output the top N most frequent k-mers per sequence
Limitations and Considerations
- K-mer Length Restriction: Limited to k-mers of length 0-12 to maintain reasonable memory usage (4^12 = 16M possible k-mers maximum)
- Memory Scaling: Memory usage scales exponentially with k-mer length: 4^k integers for the count array
- Single-threaded Processing: Current implementation processes sequences sequentially rather than in parallel
Performance Characteristics
Time complexity is O(L) per sequence where L is sequence length, with additional O(4^k log(4^k)) sorting cost per sequence. Memory usage is dominated by the O(4^k) count array, making this tool most efficient for short k-mers where detailed compositional analysis is needed.
Output Format
The output format depends on the count
parameter setting:
Without Counts (count=f, default)
sequence_id kmer1 kmer2 kmer3 ...
Each line contains the sequence identifier followed by tab-separated k-mers in descending frequency order.
With Counts (count=t)
sequence_id kmer1=count1 kmer2=count2 kmer3=count3 ...
Each k-mer is followed by an equals sign and its occurrence count within that sequence.
Example Output
# Without counts (count=f)
seq1 AT GC TA
seq2 CG GC AT
# With counts (count=t)
seq1 AT=15 GC=12 TA=8
seq2 CG=22 GC=18 AT=7
Use Cases
- Composition Analysis: Characterize nucleotide composition bias in sequences or sequence collections
- Quality Assessment: Detect adapter sequences, contamination, or unusual sequence patterns
- Comparative Genomics: Compare k-mer profiles between different samples or treatments
- Sequence Classification: Use k-mer signatures for rapid sequence type identification
- Preprocessing Step: Generate k-mer profiles for downstream machine learning or statistical analysis
- Repeat Element Detection: Identify highly repetitive short motifs within sequences
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org