KmerPosition
Counts positional occurrences of reference kmers in reads. This tool analyzes high-throughput sequencing reads and reports the positions where matching kmers are found relative to a reference sequence, enabling identification of over-representation of kmers at particular positions in reads.
Description
KmerPosition reads files of high-throughput read sequences and a reference sequence file, then reports the positions in the reads that are the start of a matching kmer sequence between the read and the reference sequence. This is useful for identifying over-representation of kmers at a particular position in reads.
Example Analysis
For a read sequence ACGTA
and reference sequence ATGTACC
with kmer length 3, the matching kmer GTA
begins in the read at position 2 (zero-indexed). The tool returns information about positions, number of kmers beginning at that position, and percentage of reads with kmers beginning at those positions.
Basic Usage
kmerposition.sh in=<input file> out=<output file> ref=<reference file> k=<kmer length>
Input may be FASTA or FASTQ format, compressed or uncompressed.
Parameters
Parameters are organized by their function in the kmer positioning analysis process.
Standard Parameters
- in=<file>
- Primary input file for high-throughput read sequences, or read 1 input if using paired-end reads.
- in2=<file>
- Read 2 input file. Only use if reads are in two separate files for paired-end sequencing.
- ref=<file>
- Reference sequence file. This file should be in FASTA format and contain reference sequences you wish to be identified in the read files.
- out=<file>
- Output file name. This file will contain all output statistics of kmer positioning and counts in tab-delimited format with columns: position, read1_count, read1_percentage, read2_count, read2_percentage.
Processing Parameters
- k=19
- Kmer length for analysis. This determines the size of sequence fragments that will be compared between reads and reference.
- rcomp=t
- If true, also match for reverse-complements of kmers. This ensures that kmers are found regardless of strand orientation.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Output Format
The output file contains tab-delimited columns with the following information:
- #pos - Position in the read (0-indexed)
- read1_count - Number of kmers found at this position in read 1
- read1_perc - Percentage of reads with kmers at this position in read 1
- read2_count - Number of kmers found at this position in read 2 (if paired-end)
- read2_perc - Percentage of reads with kmers at this position in read 2 (if paired-end)
Examples
Basic Single-End Analysis
kmerposition.sh in=reads.fastq ref=reference.fasta out=positions.txt k=21
Analyzes single-end reads against a reference using 21-mers, outputting positional kmer statistics.
Paired-End Analysis
kmerposition.sh in=reads_R1.fastq in2=reads_R2.fastq ref=reference.fasta out=positions.txt k=15
Analyzes paired-end reads against a reference using 15-mers, providing separate statistics for both read pairs.
Forward Strand Only Analysis
kmerposition.sh in=reads.fastq ref=reference.fasta out=positions.txt k=19 rcomp=f
Analyzes reads using default 19-mer length but only matches forward strand kmers (no reverse-complement matching).
Algorithm Details
KmerPosition uses a 2-bit binary encoding system for kmer comparison implemented through the AminoAcid.baseToNumber lookup table and bit manipulation operations.
Binary Kmer Encoding
The tool converts nucleotide sequences into compact binary representations using the following fixed encoding scheme:
- A → 00 (binary)
- C → 01 (binary)
- G → 10 (binary)
- T → 11 (binary)
- N → -1 (degenerate base marker)
Sliding Window Processing
The processRead() method implements sliding window kmer construction using three bit operations:
- Left Shift: kmer=((kmer<<2) - shifts current kmer left by 2 bits, dropping the oldest nucleotide
- OR Operation: |x) - ORs new nucleotide value into the rightmost 2 bits
- Masking: &mask - applies bitmask to maintain exactly k*2 bits for the kmer length
Reference Kmer Loading
The loadReference() method processes reference sequences through the addToSet() method:
- All kmers of length k are extracted from reference sequences using sliding window
- Degenerate bases (N) cause len=0 and kmer=0 to restart construction
- If rcomp=true, AminoAcid.reverseComplementBinaryFast() adds reverse-complement kmers
- LongHashSet.add() stores encoded kmers as long integers for hash table lookup
Read Processing Strategy
The processRead() method processes each read through the following steps:
- Iterates over bases array converting sequence to binary kmers using sliding window
- Uses LongHashSet.contains() for hash table lookup against reference kmer set
- Calls totalCounts.increment(i - k + 1) to record read coverage at kmer start positions
- Calls matchCounts.increment(i - k + 1) when LongHashSet.contains() returns true
- Maintains separate LongList objects (matchCounts1, matchCounts2) for paired-end data
Performance Characteristics
The binary encoding approach provides:
- Memory Efficiency: 2 bits per nucleotide vs 8 bits for character representation
- Fast Comparison: Integer comparison vs string operations
- Cache Friendly: Compact representation improves cache performance
- Scalability: Efficient processing of large datasets with minimal memory overhead
Concurrent Processing
The tool uses ConcurrentReadInputStream for efficient I/O with configurable batch sizes (default 200 reads per batch) to minimize inter-thread communication while maintaining high throughput.
Technical Notes
- Positions are reported as 0-indexed starting positions of kmers
- Degenerate bases (N) in sequences cause kmer construction to restart
- The tool handles sequences of varying lengths by tracking read coverage at each position
- Output percentages are calculated as (matches at position) / (total reads covering position) × 100
- For paired-end data, statistics are calculated independently for each read in the pair
- Maximum kmer length is limited by Java long integer size (31 nucleotides)
Applications
KmerPosition is particularly useful for:
- Adapter Detection: Identifying adapter sequences enriched at read ends
- Contamination Analysis: Finding contaminating sequences at specific read positions
- Quality Assessment: Detecting positional biases in sequencing data
- Primer Analysis: Locating primer sequences in amplicon sequencing
- Insert Size Estimation: Analyzing paired-end read overlap patterns
- Motif Discovery: Identifying enriched sequence motifs at specific positions
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org