KmerPosition

Script: kmerposition.sh Package: jasper Class: KmerPosition3.java Author: Jasper Toscani Field

Counts positional occurrences of reference kmers in reads. This tool analyzes high-throughput sequencing reads and reports the positions where matching kmers are found relative to a reference sequence, enabling identification of over-representation of kmers at particular positions in reads.

Description

KmerPosition reads files of high-throughput read sequences and a reference sequence file, then reports the positions in the reads that are the start of a matching kmer sequence between the read and the reference sequence. This is useful for identifying over-representation of kmers at a particular position in reads.

Example Analysis

For a read sequence ACGTA and reference sequence ATGTACC with kmer length 3, the matching kmer GTA begins in the read at position 2 (zero-indexed). The tool returns information about positions, number of kmers beginning at that position, and percentage of reads with kmers beginning at those positions.

Basic Usage

kmerposition.sh in=<input file> out=<output file> ref=<reference file> k=<kmer length>

Input may be FASTA or FASTQ format, compressed or uncompressed.

Parameters

Parameters are organized by their function in the kmer positioning analysis process.

Standard Parameters

in=<file>: Primary input file for high-throughput read sequences, or read 1 input if using paired-end reads.
in2=<file>: Read 2 input file. Only use if reads are in two separate files for paired-end sequencing.
ref=<file>: Reference sequence file. This file should be in FASTA format and contain reference sequences you wish to be identified in the read files.
out=<file>: Output file name. This file will contain all output statistics of kmer positioning and counts in tab-delimited format with columns: position, read1_count, read1_percentage, read2_count, read2_percentage.

Processing Parameters

k=19: Kmer length for analysis. This determines the size of sequence fragments that will be compared between reads and reference.
rcomp=t: If true, also match for reverse-complements of kmers. This ensures that kmers are found regardless of strand orientation.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Output Format

The output file contains tab-delimited columns with the following information:

#pos - Position in the read (0-indexed)
read1_count - Number of kmers found at this position in read 1
read1_perc - Percentage of reads with kmers at this position in read 1
read2_count - Number of kmers found at this position in read 2 (if paired-end)
read2_perc - Percentage of reads with kmers at this position in read 2 (if paired-end)

Examples

Basic Single-End Analysis

kmerposition.sh in=reads.fastq ref=reference.fasta out=positions.txt k=21

Analyzes single-end reads against a reference using 21-mers, outputting positional kmer statistics.

Paired-End Analysis

kmerposition.sh in=reads_R1.fastq in2=reads_R2.fastq ref=reference.fasta out=positions.txt k=15

Analyzes paired-end reads against a reference using 15-mers, providing separate statistics for both read pairs.

Forward Strand Only Analysis

kmerposition.sh in=reads.fastq ref=reference.fasta out=positions.txt k=19 rcomp=f

Analyzes reads using default 19-mer length but only matches forward strand kmers (no reverse-complement matching).

Algorithm Details

KmerPosition uses a 2-bit binary encoding system for kmer comparison implemented through the AminoAcid.baseToNumber lookup table and bit manipulation operations.

Binary Kmer Encoding

The tool converts nucleotide sequences into compact binary representations using the following fixed encoding scheme:

A → 00 (binary)
C → 01 (binary)
G → 10 (binary)
T → 11 (binary)
N → -1 (degenerate base marker)

Sliding Window Processing

The processRead() method implements sliding window kmer construction using three bit operations:

Left Shift: kmer=((kmer<<2) - shifts current kmer left by 2 bits, dropping the oldest nucleotide
OR Operation: |x) - ORs new nucleotide value into the rightmost 2 bits
Masking: &mask - applies bitmask to maintain exactly k*2 bits for the kmer length

Reference Kmer Loading

The loadReference() method processes reference sequences through the addToSet() method:

All kmers of length k are extracted from reference sequences using sliding window
Degenerate bases (N) cause len=0 and kmer=0 to restart construction
If rcomp=true, AminoAcid.reverseComplementBinaryFast() adds reverse-complement kmers
LongHashSet.add() stores encoded kmers as long integers for hash table lookup

Read Processing Strategy

The processRead() method processes each read through the following steps:

Iterates over bases array converting sequence to binary kmers using sliding window
Uses LongHashSet.contains() for hash table lookup against reference kmer set
Calls totalCounts.increment(i - k + 1) to record read coverage at kmer start positions
Calls matchCounts.increment(i - k + 1) when LongHashSet.contains() returns true
Maintains separate LongList objects (matchCounts1, matchCounts2) for paired-end data

Performance Characteristics

The binary encoding approach provides:

Memory Efficiency: 2 bits per nucleotide vs 8 bits for character representation
Fast Comparison: Integer comparison vs string operations
Cache Friendly: Compact representation improves cache performance
Scalability: Efficient processing of large datasets with minimal memory overhead

Concurrent Processing

The tool uses ConcurrentReadInputStream for efficient I/O with configurable batch sizes (default 200 reads per batch) to minimize inter-thread communication while maintaining high throughput.

Technical Notes

Positions are reported as 0-indexed starting positions of kmers
Degenerate bases (N) in sequences cause kmer construction to restart
The tool handles sequences of varying lengths by tracking read coverage at each position
Output percentages are calculated as (matches at position) / (total reads covering position) × 100
For paired-end data, statistics are calculated independently for each read in the pair
Maximum kmer length is limited by Java long integer size (31 nucleotides)

Applications

KmerPosition is particularly useful for:

Adapter Detection: Identifying adapter sequences enriched at read ends
Contamination Analysis: Finding contaminating sequences at specific read positions
Quality Assessment: Detecting positional biases in sequencing data
Primer Analysis: Locating primer sequences in amplicon sequencing
Insert Size Estimation: Analyzing paired-end read overlap patterns
Motif Discovery: Identifying enriched sequence motifs at specific positions

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org