Translate6Frames

Basic Usage

translate6frames.sh in=<input file> out=<output file>

This tool can operate in two modes: translating nucleotide sequences to amino acids (6-frame translation) or converting amino acids back to canonical nucleotide representations.

Parameters

Parameters are organized by function into input handling, output formatting, and Java runtime configuration.

Input parameters

in=<file>: Main input file. Use in=stdin.fa to pipe from stdin. Accepts fasta or fastq format, compressed or uncompressed.
in2=<file>: Input for 2nd read of pairs in a different file. Used for paired-end data stored in separate files.
int=auto: (interleaved) Set to t/f to override interleaved autodetection. Auto-detection examines file structure to determine if reads are interleaved.
qin=auto: Input quality offset: 33 (Sanger), 64 (Illumina 1.3+), or auto. Auto-detection examines quality scores to determine encoding.
aain=f: Set to true if input sequences are amino acids instead of nucleotides. When true, performs reverse translation to canonical nucleotides.
reads=-1: If positive, quit after processing this many reads or pairs. Useful for testing or processing subsets of large files.

Output parameters

out=<file>: Write output here. Use 'out=stdout.fa' to write to standard output. Output format matches input format unless overridden.
out2=<file>: Use this to write 2nd read of pairs to a different file. Required when input has paired reads in separate files.
overwrite=t: (ow) Grant permission to overwrite existing output files. Set to false to prevent accidental overwrites.
append=f: Append to existing files instead of overwriting. Useful for combining results from multiple runs.
ziplevel=2: (zl) Compression level for gzipped output; 1 (fastest compression) through 9 (best compression). Higher values use more CPU time.
fastawrap=80: Length of lines in fasta output. Sequences longer than this value are wrapped to multiple lines.
qout=auto: Output quality offset: 33 (Sanger), 64 (Illumina 1.3+), or auto. Auto uses same encoding as input.
aaout=t: Set to false to output nucleotides, true for amino acids. When translating nucleotides, this determines final output format.
tag=t: Tag read ID with the frame number, adding suffixes like ' fr1', ' fr2', etc. Helps identify which frame each translated sequence came from.
frames=6: Only print this many frames (1-6). If you already know the correct reading frame, set 'frames=3' to translate only forward frames. Default 6 translates all forward and reverse frames.

Java Parameters

-Xmx: Set Java's memory usage, overriding autodetection. -Xmx20g specifies 20 gigabytes of RAM, -Xmx200m specifies 200 megabytes. The maximum is typically 85% of physical memory.
-eoom: Exit if an out-of-memory exception occurs. Requires Java 8u92 or later. Prevents incomplete output when memory is exhausted.
-da: Disable Java assertions. May provide minor performance improvement in production use.

Examples

Basic 6-frame Translation

translate6frames.sh in=genes.fasta out=proteins.fasta

Translates nucleotide sequences to amino acids in all 6 reading frames. Each input sequence generates 6 output sequences tagged with frame identifiers (fr1-fr6).

Forward Frames Only

translate6frames.sh in=orfs.fasta out=proteins.fasta frames=3

Translates only the three forward reading frames (fr1-fr3), useful when the strand orientation is known.

Reverse Translation

translate6frames.sh in=proteins.fasta out=nucleotides.fasta aain=t aaout=f

Converts amino acid sequences back to canonical nucleotide representations using the genetic code.

Paired-End Processing

translate6frames.sh in=reads_1.fq in2=reads_2.fq out=proteins_1.fq out2=proteins_2.fq

Processes paired-end reads, translating both read 1 and read 2 to amino acids while maintaining pairing information.

No Frame Tagging

translate6frames.sh in=sequences.fasta out=translated.fasta tag=f

Translates sequences without adding frame identifiers to sequence names, producing cleaner output for downstream analysis.

Algorithm Details

Translation Implementation

The core translation functionality is implemented in the toFrames() method (lines 341-359), which performs nucleotide-to-amino acid conversion using the AminoAcid class methods:

6-frame processing: Calls AminoAcid.toAAsSixFrames(r1.bases) to generate byte arrays for all six reading frames (three forward: positions 0,1,2; three reverse complement frames)
Quality translation: Uses AminoAcid.toQualitySixFrames(r1.quality, 0) to compress nucleotide quality scores to amino acid quality scores, maintaining the 3:1 mapping ratio
Frame tagging system: Applies predefined frame identifiers using the frametag[] array: {" fr1", " fr2", " fr3", " fr4", " fr5", " fr6"} (line 404)
Read object construction: Creates new Read objects with the Read.AAMASK flag to mark amino acid sequences, preserving original read metadata (chromosome, start, stop positions)

Bidirectional Processing Architecture

The tool implements bidirectional translation controlled by boolean flags NT_IN and NT_OUT (lines 387-388):

Forward translation (NT_IN=true): Processes nucleotide sequences through the toFrames() method to generate amino acid translations
Reverse translation (lines 279-291): Converts amino acids back to nucleotides using Read.aminoToNucleic(), maintaining mate-pair relationships through explicit mate assignment
Mode detection: The aain parameter sets NT_IN=!Parse.parseBoolean(b) to toggle input interpretation, while aaout controls NT_OUT for output format
Global state coordination: Updates Shared.AMINO_IN=!NT_IN to coordinate with other BBTools components

Memory and Performance Architecture

The implementation uses concurrent streaming patterns for memory efficiency:

ConcurrentReadInputStream: Handles parallel read parsing with configurable sample rates (cris.setSampleRate(samplerate, sampleseed)) and shared header optimization
ConcurrentReadOutputStream: Manages parallel output writing with buffering (buffer size=4, line 207) and format-specific optimization
ListNum processing: Processes reads in batches through ListNum<Read> containers, enabling concurrent read/write operations while maintaining order
Memory allocation: Default heap settings use calcXmx() function with 2GB base allocation (freeRam 2000m 42 in shell script)
Format-specific optimization: FASTA input automatically sets skipquality=true (line 190) to bypass unnecessary quality processing

Quality Score Processing Details

Quality score handling follows specific compression algorithms:

Conditional processing: The skipquality flag (line 385) allows bypassing quality translation entirely, using QNULL array instead of calculated scores
3:1 compression ratio: AminoAcid.toQualitySixFrames() maps three nucleotide quality scores to one amino acid quality score, typically using the first position
Format autodetection: Input/output quality encoding is handled by Parser.processQuality() with automatic Sanger/Illumina detection
Paired-read handling: Quality processing maintains separate quality arrays (qm1, qm2) for mate pairs throughout the translation pipeline
FASTA optimization: When FASTA format is detected, quality processing is automatically disabled to improve performance and reduce memory usage

Input/Output Stream Management

File handling uses the BBTools streaming architecture:

FileFormat validation: FileFormat.testInput() and FileFormat.testOutput() methods validate file paths and determine optimal I/O strategies
Compression handling: Automatic detection and processing of compressed inputs through ReadWrite.USE_PIGZ and ReadWrite.USE_UNPIGZ flags
Threading coordination: ReadWrite.setZipThreads(Shared.threads()) coordinates compression threads with available CPU cores
Buffer management: Shared.capBuffers(4) limits memory usage by constraining the number of concurrent buffers

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org