SeqToVec

Basic Usage

seqtovec.sh in=<sequence data> out=<text vectors>

Input may be fasta or fastq, compressed or uncompressed. Output will be vectors in TSV format with the last column as the result.

Parameters

Parameters are organized by their function in the vector generation process. Seqtovec operates in two primary modes: raw mode (one-hot encoding) and spectrum mode (k-mer frequencies).

Standard parameters

in=<file>: Sequence data input file. Accepts FASTA or FASTQ format, compressed or uncompressed.
out=<file>: Vectors in TSV form, with the last column as the result. Each row represents one sequence vector with features followed by the target value.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false

Processing parameters

parse=f: Set to true to parse the result from individual sequence headers, from a tab-delimited 'result=X' term. This allows different target values for each sequence. Default: false
result=-1: Set the desired result for all vectors when not parsing from headers. This becomes the last column value for all output vectors. Default: -1

Raw mode parameters

width=55: Maximum vector width, in bases; the actual vector size will be 4+4*width+1 (where the +1 is the desired output). For longer sequences, only the first 'width' bases will be used; shorter sequences will be zero-padded. Default: 55
rcomp=f: If true, also output vectors for the reverse-complement of each sequence, effectively doubling the output size. Useful for training models that should be strand-agnostic. Default: false

Spectrum mode parameters

k=0: If k is positive, generate vectors of kmer frequencies instead of raw sequence. Range is 1-8; recommended range is 4-6. Setting k>0 automatically enables spectrum mode. Default: 0 (raw mode)
dimensions=0: If positive, restrict the vector size in spectrum mode to dimensions+5. The first 4 and last 1 columns are reserved for sequence features (length, GC, entropy, homopolymer) and result. Default: 0 (unlimited)

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions for improved performance in production environments.

Examples

Basic Raw Mode Vector Generation

seqtovec.sh in=sequences.fasta out=vectors.tsv width=50 result=1

Generate one-hot encoded vectors with width 50 for each sequence, assigning result value 1 to all vectors.

K-mer Frequency Vectors

seqtovec.sh in=sequences.fasta out=kmer_vectors.tsv k=4 dimensions=100

Generate 4-mer frequency vectors limited to 100 dimensions plus the 4 standard sequence features.

Parsing Results from Headers

seqtovec.sh in=labeled_sequences.fasta out=vectors.tsv parse=true k=5

Generate 5-mer vectors while parsing target values from sequence headers (expects 'result=X' in header).

Reverse Complement Generation

seqtovec.sh in=sequences.fasta out=both_strands.tsv width=40 rcomp=true result=0

Generate raw vectors for both forward and reverse complement sequences, useful for strand-agnostic machine learning.

Algorithm Details

Vector Generation Modes

SequenceToVector implements two distinct vectorization strategies through the toVector() and fillVector() methods:

Raw Mode (k=0)

Raw mode uses the fillRaw() method to convert sequences into one-hot encoded vectors using AminoAcid.baseToNumber0[] lookup table. The implementation:

Processes sequences up to 'width' bases using the appendRaw() method with zero-padding via "0\t0\t0\t0" strings
Creates vectors of size 4 + 4*width + 1 where hotCodes[] array provides one-hot encoding templates
Encodes bases through AminoAcid.baseToNumber4[] mapping: A=0→"1\t0\t0\t0", C=1→"0\t1\t0\t0", G=2→"0\t0\t1\t0", T=3→"0\t0\t0\t1"
First 4 positions: calculated via Tools.calcGC(), EntropyTracker.averageEntropy(), and Read.longestHomopolymer() methods
Vector assembly handled by ByteBuilder.append() with tab-separated formatting
Final position: target result parsed by Parse.parseFloat() from headers or default result0 value

Spectrum Mode (k>0)

Spectrum mode implements fillSpectrum() method using bit-masked k-mer sliding window algorithm. The implementation:

Counts k-mers using ((kmer<<2)|x)&mask bit operations where mask=~((-1)<<(2*k))
Uses kmapArray[k] pre-computed mapping tables generated by kmap() method for canonical k-mer reduction
Implements calcKSpace(k) method: (4^k + palindromes)/2 dimensions where palindromes=1<<k for even k
Normalizes frequencies using mult=(kspace*0.25f)/count to achieve relative proportions with 0.25 average
Dimension limiting uses Random(k) seeded projection when maxDimensions < fullSpace via randy.nextInt(maxDimensions)
K-mer processing through AminoAcid.reverseComplementBinaryFast() for canonical representation
Vector output via appendSpectrum() with ByteBuilder formatting at 5-decimal precision

Sequence Feature Calculation

All vectors begin with 4 standardized features calculated by specific methods:

Length ratio: len/(width+5) normalization in toVector() method
GC content: Tools.calcGC(bases) method calculating G+C fraction
Entropy: EntropyTracker.averageEntropy() using k=3 sliding windows with ThreadLocal localETrackers
Homopolymer ratio: Read.longestHomopolymer() normalized as poly/(poly+5)

Memory and Performance Implementation

The implementation includes specific memory management strategies:

ThreadLocal<EntropyTracker[]> localETrackers with pre-allocated arrays for sequence lengths 16-40 bp (minWindow to maxWindow constants)
Static kmapArray[][] and kspaceArray[] lookup tables generated during class initialization via fillArrays() method
Bit-shifting k-mer encoding using mask operations and AminoAcid.baseToNumber0[] direct array access
ConcurrentReadInputStream with streaming I/O through ListNum<Read> processing batches
Default 1GB memory via z="-Xmx1g" and z2="-Xms1g" heap settings in shell wrapper

K-mer Space Management Implementation

Spectrum mode uses precise algorithmic k-mer space handling:

K-mer range 1-8 enforced by kMax=8 constant with assert(k<16 && k>=0) validation
Canonical k-mer reduction via AminoAcid.reverseComplementBinaryFast(kmer, k) with kmer<=rcomp comparison
Random projection implements new Random(k) seeded generator when count>=maxDims, using randy.nextInt(maxDims) mapping
Bit manipulation through mask=~((-1)<<(2*k)) and kmer=((kmer<<2)|x)&mask operations for sliding window k-mer extraction

Output Format

Output is tab-separated values with the following structure:

Raw Mode Output

#dims	224	1
length_ratio	GC	entropy	homopolymer	base1_A	base1_C	base1_G	base1_T	...	baseN_T	result
0.1818	0.4545	1.2345	0.1667	1	0	0	0	...	0	1

Spectrum Mode Output

#dims	20	1
length_ratio	GC	entropy	homopolymer	kmer1_freq	kmer2_freq	...	kmerN_freq	result
0.2727	0.4545	1.2345	0.1667	0.05	0.03	...	0.02	1

The first line indicates the number of input dimensions and output dimensions. Each subsequent line represents one sequence vector.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org