SeqToVec
Generates vectors from sequence. These can be one-hot 4-bit vectors, or kmer frequency spectra.
Basic Usage
seqtovec.sh in=<sequence data> out=<text vectors>
Input may be fasta or fastq, compressed or uncompressed. Output will be vectors in TSV format with the last column as the result.
Parameters
Parameters are organized by their function in the vector generation process. Seqtovec operates in two primary modes: raw mode (one-hot encoding) and spectrum mode (k-mer frequencies).
Standard parameters
- in=<file>
- Sequence data input file. Accepts FASTA or FASTQ format, compressed or uncompressed.
- out=<file>
- Vectors in TSV form, with the last column as the result. Each row represents one sequence vector with features followed by the target value.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false
Processing parameters
- parse=f
- Set to true to parse the result from individual sequence headers, from a tab-delimited 'result=X' term. This allows different target values for each sequence. Default: false
- result=-1
- Set the desired result for all vectors when not parsing from headers. This becomes the last column value for all output vectors. Default: -1
Raw mode parameters
- width=55
- Maximum vector width, in bases; the actual vector size will be 4+4*width+1 (where the +1 is the desired output). For longer sequences, only the first 'width' bases will be used; shorter sequences will be zero-padded. Default: 55
- rcomp=f
- If true, also output vectors for the reverse-complement of each sequence, effectively doubling the output size. Useful for training models that should be strand-agnostic. Default: false
Spectrum mode parameters
- k=0
- If k is positive, generate vectors of kmer frequencies instead of raw sequence. Range is 1-8; recommended range is 4-6. Setting k>0 automatically enables spectrum mode. Default: 0 (raw mode)
- dimensions=0
- If positive, restrict the vector size in spectrum mode to dimensions+5. The first 4 and last 1 columns are reserved for sequence features (length, GC, entropy, homopolymer) and result. Default: 0 (unlimited)
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions for improved performance in production environments.
Examples
Basic Raw Mode Vector Generation
seqtovec.sh in=sequences.fasta out=vectors.tsv width=50 result=1
Generate one-hot encoded vectors with width 50 for each sequence, assigning result value 1 to all vectors.
K-mer Frequency Vectors
seqtovec.sh in=sequences.fasta out=kmer_vectors.tsv k=4 dimensions=100
Generate 4-mer frequency vectors limited to 100 dimensions plus the 4 standard sequence features.
Parsing Results from Headers
seqtovec.sh in=labeled_sequences.fasta out=vectors.tsv parse=true k=5
Generate 5-mer vectors while parsing target values from sequence headers (expects 'result=X' in header).
Reverse Complement Generation
seqtovec.sh in=sequences.fasta out=both_strands.tsv width=40 rcomp=true result=0
Generate raw vectors for both forward and reverse complement sequences, useful for strand-agnostic machine learning.
Algorithm Details
Vector Generation Modes
SequenceToVector implements two distinct vectorization strategies through the toVector() and fillVector() methods:
Raw Mode (k=0)
Raw mode uses the fillRaw() method to convert sequences into one-hot encoded vectors using AminoAcid.baseToNumber0[] lookup table. The implementation:
- Processes sequences up to 'width' bases using the appendRaw() method with zero-padding via "0\t0\t0\t0" strings
- Creates vectors of size 4 + 4*width + 1 where hotCodes[] array provides one-hot encoding templates
- Encodes bases through AminoAcid.baseToNumber4[] mapping: A=0→"1\t0\t0\t0", C=1→"0\t1\t0\t0", G=2→"0\t0\t1\t0", T=3→"0\t0\t0\t1"
- First 4 positions: calculated via Tools.calcGC(), EntropyTracker.averageEntropy(), and Read.longestHomopolymer() methods
- Vector assembly handled by ByteBuilder.append() with tab-separated formatting
- Final position: target result parsed by Parse.parseFloat() from headers or default result0 value
Spectrum Mode (k>0)
Spectrum mode implements fillSpectrum() method using bit-masked k-mer sliding window algorithm. The implementation:
- Counts k-mers using ((kmer<<2)|x)&mask bit operations where mask=~((-1)<<(2*k))
- Uses kmapArray[k] pre-computed mapping tables generated by kmap() method for canonical k-mer reduction
- Implements calcKSpace(k) method: (4^k + palindromes)/2 dimensions where palindromes=1<<k for even k
- Normalizes frequencies using mult=(kspace*0.25f)/count to achieve relative proportions with 0.25 average
- Dimension limiting uses Random(k) seeded projection when maxDimensions < fullSpace via randy.nextInt(maxDimensions)
- K-mer processing through AminoAcid.reverseComplementBinaryFast() for canonical representation
- Vector output via appendSpectrum() with ByteBuilder formatting at 5-decimal precision
Sequence Feature Calculation
All vectors begin with 4 standardized features calculated by specific methods:
- Length ratio: len/(width+5) normalization in toVector() method
- GC content: Tools.calcGC(bases) method calculating G+C fraction
- Entropy: EntropyTracker.averageEntropy() using k=3 sliding windows with ThreadLocal localETrackers
- Homopolymer ratio: Read.longestHomopolymer() normalized as poly/(poly+5)
Memory and Performance Implementation
The implementation includes specific memory management strategies:
- ThreadLocal<EntropyTracker[]> localETrackers with pre-allocated arrays for sequence lengths 16-40 bp (minWindow to maxWindow constants)
- Static kmapArray[][] and kspaceArray[] lookup tables generated during class initialization via fillArrays() method
- Bit-shifting k-mer encoding using mask operations and AminoAcid.baseToNumber0[] direct array access
- ConcurrentReadInputStream with streaming I/O through ListNum<Read> processing batches
- Default 1GB memory via z="-Xmx1g" and z2="-Xms1g" heap settings in shell wrapper
K-mer Space Management Implementation
Spectrum mode uses precise algorithmic k-mer space handling:
- K-mer range 1-8 enforced by kMax=8 constant with assert(k<16 && k>=0) validation
- Canonical k-mer reduction via AminoAcid.reverseComplementBinaryFast(kmer, k) with kmer<=rcomp comparison
- Random projection implements new Random(k) seeded generator when count>=maxDims, using randy.nextInt(maxDims) mapping
- Bit manipulation through mask=~((-1)<<(2*k)) and kmer=((kmer<<2)|x)&mask operations for sliding window k-mer extraction
Output Format
Output is tab-separated values with the following structure:
Raw Mode Output
#dims 224 1
length_ratio GC entropy homopolymer base1_A base1_C base1_G base1_T ... baseN_T result
0.1818 0.4545 1.2345 0.1667 1 0 0 0 ... 0 1
Spectrum Mode Output
#dims 20 1
length_ratio GC entropy homopolymer kmer1_freq kmer2_freq ... kmerN_freq result
0.2727 0.4545 1.2345 0.1667 0.05 0.03 ... 0.02 1
The first line indicates the number of input dimensions and output dimensions. Each subsequent line represents one sequence vector.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org