Translate6Frames
Translates nucleotide sequences to all 6 amino acid frames, or amino acids to a canonical nucleotide representation. Input may be fasta or fastq, compressed or uncompressed.
Basic Usage
translate6frames.sh in=<input file> out=<output file>
This tool can operate in two modes: translating nucleotide sequences to amino acids (6-frame translation) or converting amino acids back to canonical nucleotide representations.
Parameters
Parameters are organized by function into input handling, output formatting, and Java runtime configuration.
Input parameters
- in=<file>
- Main input file. Use in=stdin.fa to pipe from stdin. Accepts fasta or fastq format, compressed or uncompressed.
- in2=<file>
- Input for 2nd read of pairs in a different file. Used for paired-end data stored in separate files.
- int=auto
- (interleaved) Set to t/f to override interleaved autodetection. Auto-detection examines file structure to determine if reads are interleaved.
- qin=auto
- Input quality offset: 33 (Sanger), 64 (Illumina 1.3+), or auto. Auto-detection examines quality scores to determine encoding.
- aain=f
- Set to true if input sequences are amino acids instead of nucleotides. When true, performs reverse translation to canonical nucleotides.
- reads=-1
- If positive, quit after processing this many reads or pairs. Useful for testing or processing subsets of large files.
Output parameters
- out=<file>
- Write output here. Use 'out=stdout.fa' to write to standard output. Output format matches input format unless overridden.
- out2=<file>
- Use this to write 2nd read of pairs to a different file. Required when input has paired reads in separate files.
- overwrite=t
- (ow) Grant permission to overwrite existing output files. Set to false to prevent accidental overwrites.
- append=f
- Append to existing files instead of overwriting. Useful for combining results from multiple runs.
- ziplevel=2
- (zl) Compression level for gzipped output; 1 (fastest compression) through 9 (best compression). Higher values use more CPU time.
- fastawrap=80
- Length of lines in fasta output. Sequences longer than this value are wrapped to multiple lines.
- qout=auto
- Output quality offset: 33 (Sanger), 64 (Illumina 1.3+), or auto. Auto uses same encoding as input.
- aaout=t
- Set to false to output nucleotides, true for amino acids. When translating nucleotides, this determines final output format.
- tag=t
- Tag read ID with the frame number, adding suffixes like ' fr1', ' fr2', etc. Helps identify which frame each translated sequence came from.
- frames=6
- Only print this many frames (1-6). If you already know the correct reading frame, set 'frames=3' to translate only forward frames. Default 6 translates all forward and reverse frames.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding autodetection. -Xmx20g specifies 20 gigabytes of RAM, -Xmx200m specifies 200 megabytes. The maximum is typically 85% of physical memory.
- -eoom
- Exit if an out-of-memory exception occurs. Requires Java 8u92 or later. Prevents incomplete output when memory is exhausted.
- -da
- Disable Java assertions. May provide minor performance improvement in production use.
Examples
Basic 6-frame Translation
translate6frames.sh in=genes.fasta out=proteins.fasta
Translates nucleotide sequences to amino acids in all 6 reading frames. Each input sequence generates 6 output sequences tagged with frame identifiers (fr1-fr6).
Forward Frames Only
translate6frames.sh in=orfs.fasta out=proteins.fasta frames=3
Translates only the three forward reading frames (fr1-fr3), useful when the strand orientation is known.
Reverse Translation
translate6frames.sh in=proteins.fasta out=nucleotides.fasta aain=t aaout=f
Converts amino acid sequences back to canonical nucleotide representations using the genetic code.
Paired-End Processing
translate6frames.sh in=reads_1.fq in2=reads_2.fq out=proteins_1.fq out2=proteins_2.fq
Processes paired-end reads, translating both read 1 and read 2 to amino acids while maintaining pairing information.
No Frame Tagging
translate6frames.sh in=sequences.fasta out=translated.fasta tag=f
Translates sequences without adding frame identifiers to sequence names, producing cleaner output for downstream analysis.
Algorithm Details
Translation Implementation
The core translation functionality is implemented in the toFrames()
method (lines 341-359), which performs nucleotide-to-amino acid conversion using the AminoAcid
class methods:
- 6-frame processing: Calls
AminoAcid.toAAsSixFrames(r1.bases)
to generate byte arrays for all six reading frames (three forward: positions 0,1,2; three reverse complement frames) - Quality translation: Uses
AminoAcid.toQualitySixFrames(r1.quality, 0)
to compress nucleotide quality scores to amino acid quality scores, maintaining the 3:1 mapping ratio - Frame tagging system: Applies predefined frame identifiers using the
frametag[]
array: {" fr1", " fr2", " fr3", " fr4", " fr5", " fr6"} (line 404) - Read object construction: Creates new
Read
objects with theRead.AAMASK
flag to mark amino acid sequences, preserving original read metadata (chromosome, start, stop positions)
Bidirectional Processing Architecture
The tool implements bidirectional translation controlled by boolean flags NT_IN
and NT_OUT
(lines 387-388):
- Forward translation (NT_IN=true): Processes nucleotide sequences through the
toFrames()
method to generate amino acid translations - Reverse translation (lines 279-291): Converts amino acids back to nucleotides using
Read.aminoToNucleic()
, maintaining mate-pair relationships through explicit mate assignment - Mode detection: The
aain
parameter setsNT_IN=!Parse.parseBoolean(b)
to toggle input interpretation, whileaaout
controlsNT_OUT
for output format - Global state coordination: Updates
Shared.AMINO_IN=!NT_IN
to coordinate with other BBTools components
Memory and Performance Architecture
The implementation uses concurrent streaming patterns for memory efficiency:
- ConcurrentReadInputStream: Handles parallel read parsing with configurable sample rates (
cris.setSampleRate(samplerate, sampleseed)
) and shared header optimization - ConcurrentReadOutputStream: Manages parallel output writing with buffering (buffer size=4, line 207) and format-specific optimization
- ListNum processing: Processes reads in batches through
ListNum<Read>
containers, enabling concurrent read/write operations while maintaining order - Memory allocation: Default heap settings use
calcXmx()
function with 2GB base allocation (freeRam 2000m 42
in shell script) - Format-specific optimization: FASTA input automatically sets
skipquality=true
(line 190) to bypass unnecessary quality processing
Quality Score Processing Details
Quality score handling follows specific compression algorithms:
- Conditional processing: The
skipquality
flag (line 385) allows bypassing quality translation entirely, usingQNULL
array instead of calculated scores - 3:1 compression ratio:
AminoAcid.toQualitySixFrames()
maps three nucleotide quality scores to one amino acid quality score, typically using the first position - Format autodetection: Input/output quality encoding is handled by
Parser.processQuality()
with automatic Sanger/Illumina detection - Paired-read handling: Quality processing maintains separate quality arrays (
qm1
,qm2
) for mate pairs throughout the translation pipeline - FASTA optimization: When FASTA format is detected, quality processing is automatically disabled to improve performance and reduce memory usage
Input/Output Stream Management
File handling uses the BBTools streaming architecture:
- FileFormat validation:
FileFormat.testInput()
andFileFormat.testOutput()
methods validate file paths and determine optimal I/O strategies - Compression handling: Automatic detection and processing of compressed inputs through
ReadWrite.USE_PIGZ
andReadWrite.USE_UNPIGZ
flags - Threading coordination:
ReadWrite.setZipThreads(Shared.threads())
coordinates compression threads with available CPU cores - Buffer management:
Shared.capBuffers(4)
limits memory usage by constraining the number of concurrent buffers
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org