SeqToVec

Script: seqtovec.sh Package: ml Class: SequenceToVector.java

Generates vectors from sequence. These can be one-hot 4-bit vectors, or kmer frequency spectra.

Basic Usage

seqtovec.sh in=<sequence data> out=<text vectors>

Input may be fasta or fastq, compressed or uncompressed. Output will be vectors in TSV format with the last column as the result.

Parameters

Parameters are organized by their function in the vector generation process. Seqtovec operates in two primary modes: raw mode (one-hot encoding) and spectrum mode (k-mer frequencies).

Standard parameters

in=<file>
Sequence data input file. Accepts FASTA or FASTQ format, compressed or uncompressed.
out=<file>
Vectors in TSV form, with the last column as the result. Each row represents one sequence vector with features followed by the target value.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false

Processing parameters

parse=f
Set to true to parse the result from individual sequence headers, from a tab-delimited 'result=X' term. This allows different target values for each sequence. Default: false
result=-1
Set the desired result for all vectors when not parsing from headers. This becomes the last column value for all output vectors. Default: -1

Raw mode parameters

width=55
Maximum vector width, in bases; the actual vector size will be 4+4*width+1 (where the +1 is the desired output). For longer sequences, only the first 'width' bases will be used; shorter sequences will be zero-padded. Default: 55
rcomp=f
If true, also output vectors for the reverse-complement of each sequence, effectively doubling the output size. Useful for training models that should be strand-agnostic. Default: false

Spectrum mode parameters

k=0
If k is positive, generate vectors of kmer frequencies instead of raw sequence. Range is 1-8; recommended range is 4-6. Setting k>0 automatically enables spectrum mode. Default: 0 (raw mode)
dimensions=0
If positive, restrict the vector size in spectrum mode to dimensions+5. The first 4 and last 1 columns are reserved for sequence features (length, GC, entropy, homopolymer) and result. Default: 0 (unlimited)

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions for improved performance in production environments.

Examples

Basic Raw Mode Vector Generation

seqtovec.sh in=sequences.fasta out=vectors.tsv width=50 result=1

Generate one-hot encoded vectors with width 50 for each sequence, assigning result value 1 to all vectors.

K-mer Frequency Vectors

seqtovec.sh in=sequences.fasta out=kmer_vectors.tsv k=4 dimensions=100

Generate 4-mer frequency vectors limited to 100 dimensions plus the 4 standard sequence features.

Parsing Results from Headers

seqtovec.sh in=labeled_sequences.fasta out=vectors.tsv parse=true k=5

Generate 5-mer vectors while parsing target values from sequence headers (expects 'result=X' in header).

Reverse Complement Generation

seqtovec.sh in=sequences.fasta out=both_strands.tsv width=40 rcomp=true result=0

Generate raw vectors for both forward and reverse complement sequences, useful for strand-agnostic machine learning.

Algorithm Details

Vector Generation Modes

SequenceToVector implements two distinct vectorization strategies through the toVector() and fillVector() methods:

Raw Mode (k=0)

Raw mode uses the fillRaw() method to convert sequences into one-hot encoded vectors using AminoAcid.baseToNumber0[] lookup table. The implementation:

Spectrum Mode (k>0)

Spectrum mode implements fillSpectrum() method using bit-masked k-mer sliding window algorithm. The implementation:

Sequence Feature Calculation

All vectors begin with 4 standardized features calculated by specific methods:

Memory and Performance Implementation

The implementation includes specific memory management strategies:

K-mer Space Management Implementation

Spectrum mode uses precise algorithmic k-mer space handling:

Output Format

Output is tab-separated values with the following structure:

Raw Mode Output

#dims	224	1
length_ratio	GC	entropy	homopolymer	base1_A	base1_C	base1_G	base1_T	...	baseN_T	result
0.1818	0.4545	1.2345	0.1667	1	0	0	0	...	0	1

Spectrum Mode Output

#dims	20	1
length_ratio	GC	entropy	homopolymer	kmer1_freq	kmer2_freq	...	kmerN_freq	result
0.2727	0.4545	1.2345	0.1667	0.05	0.03	...	0.02	1

The first line indicates the number of input dimensions and output dimensions. Each subsequent line represents one sequence vector.

Support

For questions and support: