TetramerFreq
DNA Tetramer analysis. DNA tetramers are counted for each sub-sequence of window size in the sequence. The window slides along the sequence by the step length. Sub-sequence shorter than the window size is ignored. Tetramers containing N are ignored.
Basic Usage
tetramerfreq.sh in=<input file> out=<output file> step=500 window=2000
Input may be fasta or fastq, compressed or uncompressed.
Parameters
TetramerFreq parameters are organized into standard parameters for controlling the analysis and Java parameters for memory management.
Standard parameters
- in=<file>
- DNA sequence input file. Accepts fasta or fastq format, compressed or uncompressed.
- out=<file>
- Output file name. Tab-delimited file containing tetramer frequencies for each window. Defaults to stdout if not specified.
- step=INT
- Step size - distance between the start positions of consecutive windows (default 500). Smaller step sizes provide higher resolution but increase computation time and output size.
- window=INT
- Window size - length of sequence analyzed in each window (default 2000). Values ≤0 turn windowing off and analyze entire sequences. For short reads, use window=0.
- short=T/F
- Print lines for sequences shorter than window size (default false). When true, sequences shorter than the window are still analyzed and output.
- k=INT
- Kmer length for analysis (default 4). Standard tetramer analysis uses k=4, but other kmer lengths are supported. Larger k values require exponentially more memory.
- gc
- Print a GC column in the output showing GC content for each analyzed window.
- float
- Output kmer frequencies as decimal fractions instead of raw counts. Frequencies sum to 1.0 for each window.
- comp
- Output GC-compensated kmer frequencies. Adjusts frequencies to account for GC bias, normalizing for compositional differences.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Tetramer Analysis
tetramerfreq.sh in=genome.fasta out=tetramers.txt
Analyzes genome.fasta using default 2kb windows with 500bp steps, outputting tetramer counts for each window.
High Resolution Analysis
tetramerfreq.sh in=sequence.fq out=freq.txt step=100 window=1000
Uses smaller windows (1kb) and steps (100bp) for higher resolution analysis of sequence composition changes.
Whole Sequence Analysis
tetramerfreq.sh in=reads.fq out=composition.txt window=0 short=t
Analyzes entire sequences without windowing - useful for short reads or when you want overall composition per sequence.
GC-Compensated Frequencies
tetramerfreq.sh in=genome.fa out=normalized.txt comp=t float=t gc=t
Outputs GC-compensated tetramer frequencies as decimals, with GC content column for bias correction analysis.
Hexamer Analysis
tetramerfreq.sh in=sequence.fa out=hexamers.txt k=6
Analyzes 6-mers instead of tetramers. Note that hexamer analysis requires significantly more memory (4^6 = 4,096 kmers vs 4^4 = 256 tetramers).
Algorithm Details
TetramerFrequencies implements a sliding window kmer counting algorithm using ArrayBlockingQueue-based producer-consumer pattern for genome composition analysis:
Sliding Window Strategy
The algorithm slides a window of specified size across each input sequence, advancing by the step size. For each window position:
- Extracts the subsequence within the window boundaries
- Counts all k-mers (default tetramers) using canonical representation
- Calculates GC content if requested
- Outputs results for the current window
Canonical Kmer Representation
Uses canonical kmers to avoid double-counting reverse complements. For each kmer, the lexicographically smaller of the kmer and its reverse complement is stored. This reduces memory usage by ~50% and provides consistent results regardless of sequence orientation.
2-Bit Kmer Encoding
Kmers are encoded as integers using 2-bit-per-base representation:
- A=0, C=1, G=2, T=3
- Tetramers fit in 8 bits (4 bases × 2 bits)
- Bit-shifting operations enable rapid kmer updates
- N bases reset the kmer counter to handle ambiguous nucleotides
Memory Management Strategy
The implementation uses thread-local int arrays and canonical kmer mapping:
- BinObject.makeRemap() creates canonical kmer mapping arrays
- Thread-local int[canonicalKmers] count arrays in PrintThread instances
- ByteStreamWriter with ByteBuilder buffers for streaming output
- ArrayBlockingQueue(threads+1) coordinates work distribution
GC Compensation Algorithm
When GC compensation is enabled, Vector.compensate(counts, k, gcmap) adjusts frequencies using BinObject.gcmap() mapping to normalize for compositional bias. This is particularly useful for:
- Comparing regions with different GC content
- Identifying unusual compositional patterns
- Normalizing for sequencing or PCR bias
Multithreading Architecture
Uses ArrayBlockingQueue producer-consumer pattern with Shared.capThreads(16) limit:
- Main thread reads sequences via ConcurrentReadInputStream
- PrintThread worker threads process Line objects from ArrayBlockingQueue
- ArrayBlockingQueue<Line>(threads+1) buffers work distribution
- ByteStreamWriter.add(ByteBuilder, id) maintains output ordering
Performance Characteristics
- Time Complexity: O(n) where n is total sequence length
- Memory Usage: parseJavaArgs("--percent=42") sets 42% of available RAM
- Thread Scaling: Tools.mid(1, Shared.threads(), 32) with capThreads(16) limit
- Throughput: Tools.format("%.2fk reads/sec", (readsProcessed/elapsed)*1000000)
Output Format
The output is a tab-delimited file with the following structure:
Output Columns
- scaffold: Sequence name from input file header
- range: Window coordinates (1-based, inclusive)
- GC: GC content fraction (if gc=t)
- Tetramer columns: One column per canonical tetramer (AAAA, AAAC, AAAG, etc.)
Output Modes
- Counts (default): Raw tetramer counts for each window
- Frequencies (float=t): Tetramer frequencies (counts/total) summing to 1.0
- GC-compensated (comp=t): Bias-corrected frequencies or counts
Sample Output
scaffold range GC AAAA AAAC AAAG AAAT ...
seq1 1-2000 0.4200 45 23 67 89 ...
seq1 501-2500 0.4150 43 25 65 91 ...
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org