TetramerFreq

Script: tetramerfreq.sh Package: jgi Class: TetramerFrequencies.java

DNA Tetramer analysis. DNA tetramers are counted for each sub-sequence of window size in the sequence. The window slides along the sequence by the step length. Sub-sequence shorter than the window size is ignored. Tetramers containing N are ignored.

Basic Usage

tetramerfreq.sh in=<input file> out=<output file> step=500 window=2000

Input may be fasta or fastq, compressed or uncompressed.

Parameters

TetramerFreq parameters are organized into standard parameters for controlling the analysis and Java parameters for memory management.

Standard parameters

in=<file>
DNA sequence input file. Accepts fasta or fastq format, compressed or uncompressed.
out=<file>
Output file name. Tab-delimited file containing tetramer frequencies for each window. Defaults to stdout if not specified.
step=INT
Step size - distance between the start positions of consecutive windows (default 500). Smaller step sizes provide higher resolution but increase computation time and output size.
window=INT
Window size - length of sequence analyzed in each window (default 2000). Values ≤0 turn windowing off and analyze entire sequences. For short reads, use window=0.
short=T/F
Print lines for sequences shorter than window size (default false). When true, sequences shorter than the window are still analyzed and output.
k=INT
Kmer length for analysis (default 4). Standard tetramer analysis uses k=4, but other kmer lengths are supported. Larger k values require exponentially more memory.
gc
Print a GC column in the output showing GC content for each analyzed window.
float
Output kmer frequencies as decimal fractions instead of raw counts. Frequencies sum to 1.0 for each window.
comp
Output GC-compensated kmer frequencies. Adjusts frequencies to account for GC bias, normalizing for compositional differences.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Tetramer Analysis

tetramerfreq.sh in=genome.fasta out=tetramers.txt

Analyzes genome.fasta using default 2kb windows with 500bp steps, outputting tetramer counts for each window.

High Resolution Analysis

tetramerfreq.sh in=sequence.fq out=freq.txt step=100 window=1000

Uses smaller windows (1kb) and steps (100bp) for higher resolution analysis of sequence composition changes.

Whole Sequence Analysis

tetramerfreq.sh in=reads.fq out=composition.txt window=0 short=t

Analyzes entire sequences without windowing - useful for short reads or when you want overall composition per sequence.

GC-Compensated Frequencies

tetramerfreq.sh in=genome.fa out=normalized.txt comp=t float=t gc=t

Outputs GC-compensated tetramer frequencies as decimals, with GC content column for bias correction analysis.

Hexamer Analysis

tetramerfreq.sh in=sequence.fa out=hexamers.txt k=6

Analyzes 6-mers instead of tetramers. Note that hexamer analysis requires significantly more memory (4^6 = 4,096 kmers vs 4^4 = 256 tetramers).

Algorithm Details

TetramerFrequencies implements a sliding window kmer counting algorithm using ArrayBlockingQueue-based producer-consumer pattern for genome composition analysis:

Sliding Window Strategy

The algorithm slides a window of specified size across each input sequence, advancing by the step size. For each window position:

Canonical Kmer Representation

Uses canonical kmers to avoid double-counting reverse complements. For each kmer, the lexicographically smaller of the kmer and its reverse complement is stored. This reduces memory usage by ~50% and provides consistent results regardless of sequence orientation.

2-Bit Kmer Encoding

Kmers are encoded as integers using 2-bit-per-base representation:

Memory Management Strategy

The implementation uses thread-local int arrays and canonical kmer mapping:

GC Compensation Algorithm

When GC compensation is enabled, Vector.compensate(counts, k, gcmap) adjusts frequencies using BinObject.gcmap() mapping to normalize for compositional bias. This is particularly useful for:

Multithreading Architecture

Uses ArrayBlockingQueue producer-consumer pattern with Shared.capThreads(16) limit:

Performance Characteristics

Output Format

The output is a tab-delimited file with the following structure:

Output Columns

Output Modes

Sample Output

scaffold    range       GC      AAAA    AAAC    AAAG    AAAT    ...
seq1        1-2000      0.4200  45      23      67      89      ...
seq1        501-2500    0.4150  43      25      65      91      ...

Support

For questions and support: