KmerPosition

Script: kmerposition.sh Package: jasper Class: KmerPosition3.java Author: Jasper Toscani Field

Counts positional occurrences of reference kmers in reads. This tool analyzes high-throughput sequencing reads and reports the positions where matching kmers are found relative to a reference sequence, enabling identification of over-representation of kmers at particular positions in reads.

Description

KmerPosition reads files of high-throughput read sequences and a reference sequence file, then reports the positions in the reads that are the start of a matching kmer sequence between the read and the reference sequence. This is useful for identifying over-representation of kmers at a particular position in reads.

Example Analysis

For a read sequence ACGTA and reference sequence ATGTACC with kmer length 3, the matching kmer GTA begins in the read at position 2 (zero-indexed). The tool returns information about positions, number of kmers beginning at that position, and percentage of reads with kmers beginning at those positions.

Basic Usage

kmerposition.sh in=<input file> out=<output file> ref=<reference file> k=<kmer length>

Input may be FASTA or FASTQ format, compressed or uncompressed.

Parameters

Parameters are organized by their function in the kmer positioning analysis process.

Standard Parameters

in=<file>
Primary input file for high-throughput read sequences, or read 1 input if using paired-end reads.
in2=<file>
Read 2 input file. Only use if reads are in two separate files for paired-end sequencing.
ref=<file>
Reference sequence file. This file should be in FASTA format and contain reference sequences you wish to be identified in the read files.
out=<file>
Output file name. This file will contain all output statistics of kmer positioning and counts in tab-delimited format with columns: position, read1_count, read1_percentage, read2_count, read2_percentage.

Processing Parameters

k=19
Kmer length for analysis. This determines the size of sequence fragments that will be compared between reads and reference.
rcomp=t
If true, also match for reverse-complements of kmers. This ensures that kmers are found regardless of strand orientation.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Output Format

The output file contains tab-delimited columns with the following information:

Examples

Basic Single-End Analysis

kmerposition.sh in=reads.fastq ref=reference.fasta out=positions.txt k=21

Analyzes single-end reads against a reference using 21-mers, outputting positional kmer statistics.

Paired-End Analysis

kmerposition.sh in=reads_R1.fastq in2=reads_R2.fastq ref=reference.fasta out=positions.txt k=15

Analyzes paired-end reads against a reference using 15-mers, providing separate statistics for both read pairs.

Forward Strand Only Analysis

kmerposition.sh in=reads.fastq ref=reference.fasta out=positions.txt k=19 rcomp=f

Analyzes reads using default 19-mer length but only matches forward strand kmers (no reverse-complement matching).

Algorithm Details

KmerPosition uses a 2-bit binary encoding system for kmer comparison implemented through the AminoAcid.baseToNumber lookup table and bit manipulation operations.

Binary Kmer Encoding

The tool converts nucleotide sequences into compact binary representations using the following fixed encoding scheme:

Sliding Window Processing

The processRead() method implements sliding window kmer construction using three bit operations:

  1. Left Shift: kmer=((kmer<<2) - shifts current kmer left by 2 bits, dropping the oldest nucleotide
  2. OR Operation: |x) - ORs new nucleotide value into the rightmost 2 bits
  3. Masking: &mask - applies bitmask to maintain exactly k*2 bits for the kmer length

Reference Kmer Loading

The loadReference() method processes reference sequences through the addToSet() method:

Read Processing Strategy

The processRead() method processes each read through the following steps:

Performance Characteristics

The binary encoding approach provides:

Concurrent Processing

The tool uses ConcurrentReadInputStream for efficient I/O with configurable batch sizes (default 200 reads per batch) to minimize inter-thread communication while maintaining high throughput.

Technical Notes

Applications

KmerPosition is particularly useful for:

Support

For questions and support: