TetramerFreq

Script: tetramerfreq.sh Package: jgi Class: TetramerFrequencies.java

DNA Tetramer analysis. DNA tetramers are counted for each sub-sequence of window size in the sequence. The window slides along the sequence by the step length. Sub-sequence shorter than the window size is ignored. Tetramers containing N are ignored.

Basic Usage

tetramerfreq.sh in=<input file> out=<output file> step=500 window=2000

Input may be fasta or fastq, compressed or uncompressed.

Parameters

TetramerFreq parameters are organized into standard parameters for controlling the analysis and Java parameters for memory management.

Standard parameters

in=<file>: DNA sequence input file. Accepts fasta or fastq format, compressed or uncompressed.
out=<file>: Output file name. Tab-delimited file containing tetramer frequencies for each window. Defaults to stdout if not specified.
step=INT: Step size - distance between the start positions of consecutive windows (default 500). Smaller step sizes provide higher resolution but increase computation time and output size.
window=INT: Window size - length of sequence analyzed in each window (default 2000). Values ≤0 turn windowing off and analyze entire sequences. For short reads, use window=0.
short=T/F: Print lines for sequences shorter than window size (default false). When true, sequences shorter than the window are still analyzed and output.
k=INT: Kmer length for analysis (default 4). Standard tetramer analysis uses k=4, but other kmer lengths are supported. Larger k values require exponentially more memory.
gc: Print a GC column in the output showing GC content for each analyzed window.
float: Output kmer frequencies as decimal fractions instead of raw counts. Frequencies sum to 1.0 for each window.
comp: Output GC-compensated kmer frequencies. Adjusts frequencies to account for GC bias, normalizing for compositional differences.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Tetramer Analysis

tetramerfreq.sh in=genome.fasta out=tetramers.txt

Analyzes genome.fasta using default 2kb windows with 500bp steps, outputting tetramer counts for each window.

High Resolution Analysis

tetramerfreq.sh in=sequence.fq out=freq.txt step=100 window=1000

Uses smaller windows (1kb) and steps (100bp) for higher resolution analysis of sequence composition changes.

Whole Sequence Analysis

tetramerfreq.sh in=reads.fq out=composition.txt window=0 short=t

Analyzes entire sequences without windowing - useful for short reads or when you want overall composition per sequence.

GC-Compensated Frequencies

tetramerfreq.sh in=genome.fa out=normalized.txt comp=t float=t gc=t

Outputs GC-compensated tetramer frequencies as decimals, with GC content column for bias correction analysis.

Hexamer Analysis

tetramerfreq.sh in=sequence.fa out=hexamers.txt k=6

Analyzes 6-mers instead of tetramers. Note that hexamer analysis requires significantly more memory (4^6 = 4,096 kmers vs 4^4 = 256 tetramers).

Algorithm Details

TetramerFrequencies implements a sliding window kmer counting algorithm using ArrayBlockingQueue-based producer-consumer pattern for genome composition analysis:

Sliding Window Strategy

The algorithm slides a window of specified size across each input sequence, advancing by the step size. For each window position:

Extracts the subsequence within the window boundaries
Counts all k-mers (default tetramers) using canonical representation
Calculates GC content if requested
Outputs results for the current window

Canonical Kmer Representation

Uses canonical kmers to avoid double-counting reverse complements. For each kmer, the lexicographically smaller of the kmer and its reverse complement is stored. This reduces memory usage by ~50% and provides consistent results regardless of sequence orientation.

2-Bit Kmer Encoding

Kmers are encoded as integers using 2-bit-per-base representation:

A=0, C=1, G=2, T=3
Tetramers fit in 8 bits (4 bases × 2 bits)
Bit-shifting operations enable rapid kmer updates
N bases reset the kmer counter to handle ambiguous nucleotides

Memory Management Strategy

The implementation uses thread-local int arrays and canonical kmer mapping:

BinObject.makeRemap() creates canonical kmer mapping arrays
Thread-local int[canonicalKmers] count arrays in PrintThread instances
ByteStreamWriter with ByteBuilder buffers for streaming output
ArrayBlockingQueue(threads+1) coordinates work distribution

GC Compensation Algorithm

When GC compensation is enabled, Vector.compensate(counts, k, gcmap) adjusts frequencies using BinObject.gcmap() mapping to normalize for compositional bias. This is particularly useful for:

Comparing regions with different GC content
Identifying unusual compositional patterns
Normalizing for sequencing or PCR bias

Multithreading Architecture

Uses ArrayBlockingQueue producer-consumer pattern with Shared.capThreads(16) limit:

Main thread reads sequences via ConcurrentReadInputStream
PrintThread worker threads process Line objects from ArrayBlockingQueue
ArrayBlockingQueue<Line>(threads+1) buffers work distribution
ByteStreamWriter.add(ByteBuilder, id) maintains output ordering

Performance Characteristics

Time Complexity: O(n) where n is total sequence length
Memory Usage: parseJavaArgs("--percent=42") sets 42% of available RAM
Thread Scaling: Tools.mid(1, Shared.threads(), 32) with capThreads(16) limit
Throughput: Tools.format("%.2fk reads/sec", (readsProcessed/elapsed)*1000000)

Output Format

The output is a tab-delimited file with the following structure:

Output Columns

scaffold: Sequence name from input file header
range: Window coordinates (1-based, inclusive)
GC: GC content fraction (if gc=t)
Tetramer columns: One column per canonical tetramer (AAAA, AAAC, AAAG, etc.)

Output Modes

Counts (default): Raw tetramer counts for each window
Frequencies (float=t): Tetramer frequencies (counts/total) summing to 1.0
GC-compensated (comp=t): Bias-corrected frequencies or counts

Sample Output

scaffold    range       GC      AAAA    AAAC    AAAG    AAAT    ...
seq1        1-2000      0.4200  45      23      67      89      ...
seq1        501-2500    0.4150  43      25      65      91      ...

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org