Seal

Script: seal.sh Package: jgi Class: Seal.java

SEAL (Sequence Expression AnaLyzer) performs high-speed alignment-free sequence quantification by counting long kmers that match between reads and reference sequences. Unlike BBDuk, which associates one kmer with one reference, SEAL can associate kmers with multiple references, making it ideal for RNA-seq with alternative splicing, related organisms, and isoform quantification.

Key Features

Multi-Reference Kmer Mapping

Unlike BBDuk's one-kmer-to-one-reference limitation, SEAL allows kmers to be associated with unlimited references, enabling analysis of related organisms, gene isoforms, and sequences sharing common regions.

RNA-seq Specialization

Designed specifically for RNA-seq applications with alternative splicing support. Handles splice junctions naturally through kmer-based matching without alignment gaps.

High-Speed Processing

Generally faster than BBSplit for read binning and quantification, though uses approximately 3x memory (20 bytes/base vs 6 bytes/base) with unlimited read length support.

Flexible Output Modes

Supports expression quantification, read binning, statistics generation, and taxonomic classification with multiple ambiguity handling strategies.

Basic Usage

seal.sh in=<file> ref=<references> pattern=out_%.fq outu=unmapped.fq stats=stats.txt

RNA-seq Expression Quantification

# Basic transcript quantification with RPKM output
seal.sh in=reads.fq ref=transcripts.fa stats=sealstats.txt rpkm=sealrpkm.txt ambig=random

# Multiple reference files with per-file statistics
seal.sh in=reads.fq ref=ref1.fa,ref2.fa,ref3.fa refstats=refstats.txt

Read Binning by Best Match

# Split reads into files by organism match
seal.sh in=reads.fq ref=bacterial_genomes.fa pattern=out_%.fq outu=unmapped.fq ambig=all

# With compressed output
seal.sh in=reads.fq *.fasta.gz pattern=out_%.fq.gz outu=unmapped.fq.gz

Taxonomic Classification

# Display taxonomic information at phylum level
seal.sh in=reads.fq ref=organisms.fasta minlevel=phylum maxlevel=phylum \
  tax=tax_results.txt tree=tree.taxtree.gz gi=gitable.int1d.gz

Statistics Summarization

# Summarize multiple SEAL runs for cross-contamination analysis
summarizeseal.sh sealstats*.txt out=summary.txt

Understanding SEAL's Core Concepts

Ambiguity Modes

SEAL handles reads that match multiple references equally well through four strategies:

  • random (default): Randomly selects one best-matching reference. Every read assigned to exactly one reference.
  • first: Uses the first best-matching reference sequence encountered.
  • toss: Considers ambiguously-mapped reads as unmapped.
  • all: Assigns reads to all best-matching references (total assignments > input reads if ambiguous matches exist).

Example: If a read shares 2 kmers with reference A, 2 with B, and 1 with C, it will choose randomly between A and B (both better than C) with ambig=random.

Clearzone Parameter

Controls ambiguity sensitivity by setting the maximum kmer match difference between best and worst references considered equivalent:

  • clearzone=0 (default): Only exact ties are ambiguous
  • clearzone=2: References within 2 kmer hits of the best are considered tied
  • clearzone=large number: All references sharing any kmers are considered tied

Example: Read shares 10 kmers with ref A, 8 with B, 3 with C. At clearzone=2, A and B are ambiguous. At clearzone=7, A, B, and C are all ambiguous.

Match Modes

Determines kmer counting strategy with speed/accuracy tradeoffs:

  • all (default): Count all kmers in each read (slowest, most accurate)
  • first: Stop after first matching kmer (fastest, less accurate)
  • unique: Stop after first uniquely matching kmer (intermediate speed/accuracy)

Reference Name Handling

SEAL tracks statistics per sequence by default. For multi-contig references:

  • Default behavior: One statistics line per reference sequence (can be many lines per genome)
  • refnames=t: Statistics reported per reference file rather than per sequence
  • fuse.sh preprocessing: Concatenate sequences into single sequence per file (recommended for genomes)

Parameters

SEAL parameters are organized into functional groups matching the shell script organization. All parameters from the original implementation are preserved to maintain compatibility.

Input parameters

in=<file>
Main input file. Use in=stdin.fq to pipe from stdin. Accepts fasta or fastq, compressed or uncompressed.
in2=<file>
Input for 2nd read of pairs in a different file.
ref=<file,file>
Comma-delimited list of reference files or directories. Filenames may also be used without ref=, e.g. *.fa. Supports keywords: adapters, artifacts, phix, lambda, pjet, mtst, kapa.
literal=<seq,seq>
Comma-delimited list of literal reference sequences for direct sequence matching.
touppercase=f
(tuc) Change all bases to upper-case during processing.
interleaved=auto
(int) Override interleaved autodetection with t/f. Must be set manually when streaming fastq input.
qin=auto
Input quality offset: 33 (Sanger), 64 (Illumina 1.3+), or auto for automatic detection.
reads=-1
If positive, quit after processing X reads or pairs. Useful for testing on subsets.
copyundefined=f
(cu) Process non-AGCT IUPAC reference bases by making all possible unambiguous copies. Intended for short motifs or adapter barcodes, as time/memory use is exponential.

Output parameters

out=<file>
(outmatch) Write reads containing kmers matching the reference. Use 'out=stdout.fq' to pipe to standard out.
out2=<file>
(outmatch2) Write 2nd read of pairs to a different file when using separate output files.
outu=<file>
(outunmatched) Write reads that do not contain kmers matching the database.
outu2=<file>
(outunmatched2) Write 2nd read of unmatched pairs to a different file.
pattern=<file>
Write reads to one stream per reference sequence match, replacing % with sequence name. Example: pattern=%.fq creates dog.fq and cat.fq for references named dog and cat.
stats=<file>
Write statistics about which references were detected and their hit counts.
refstats=<file>
Write statistics on a per-reference-file basis rather than per-sequence.
rpkm=<file>
Write RPKM (Reads Per Kilobase Million) for each reference sequence, essential for RNA-seq quantification.
dump=<file>
Dump kmer tables to a file in fasta format for debugging or analysis.
nzo=t
Only write statistics about reference sequences with nonzero hits to reduce output size.
overwrite=t
(ow) Grant permission to overwrite existing output files.
showspeed=t
(ss) Set to 'f' to suppress display of processing speed statistics.
ziplevel=2
(zl) Compression level for output; 1 (minimum/fastest) through 9 (maximum/slowest).
fastawrap=80
Length of lines in fasta output files.
qout=auto
Output quality offset: 33 (Sanger), 64 (Illumina 1.3+), or auto to match input.
statscolumns=5
(cols) Number of columns for stats output: 3 (basic) or 5 (includes base counts).
rename=f
Append matched reference sequence names to read headers for tracking.
addcount=t
If renaming, include the reference hit counts in the header.
tophitonly=f
If renaming, only add the top hit rather than all hits.
refnames=f
Use names of reference files rather than scaffold IDs. More efficient for multiple reference files than tracking statistics on a per-sequence basis.
trd=f
Truncate read and reference names at the first whitespace character.
ordered=f
Set to true to output reads in the same order as input (slower but preserves order).
kpt=t
(keepPairsTogether) Paired reads will always be assigned to the same reference sequence.

Processing parameters

k=31
Kmer length used for finding matches. References shorter than k will not be found. Must be at least 1, maximum 31. Longer kmers are more specific but less sensitive.
rcomp=t
Look for reverse-complements of kmers in addition to forward kmers, essential for most applications.
maskmiddle=t
(mm) Treat the middle base(s) of a kmer as wildcards to increase sensitivity with sequencing errors. Can be set to a number (e.g., mm=3) to mask that many bp. Default mm=t corresponds to mm=1 for odd k, mm=2 for even k.
mm
Alias for maskmiddle. Treat the middle base(s) of a kmer as wildcards to increase sensitivity with sequencing errors.
minkmerhits=1
(mkh) Minimum number of kmer hits required for a read to be considered a match.
minkmerfraction=0.0
(mkf) Minimum fraction of read kmers that must hit a reference for the read to be considered a match.
hammingdistance=0
(hdist) Maximum Hamming distance for reference kmers (substitutions only). Memory use is proportional to (3*K)^hdist.
qhdist=0
Hamming distance for query kmers; impacts speed but not memory usage.
editdistance=0
(edist) Maximum edit distance from reference kmers (substitutions and indels). Memory use is proportional to (8*K)^edist.
forbidn=f
(fn) Forbid matching of read kmers containing N. By default, N-containing kmers will match reference 'A' if hdist>0 or edist>0 to increase sensitivity.
match=all
Determines when to quit looking for kmer matches:
  • all: Attempt to match all kmers in each read
  • first: Quit after the first matching kmer
  • unique: Quit after the first uniquely matching kmer
ambiguous=random
(ambig) Behavior for ambiguously-mapped reads (equal kmer matches to multiple sequences):
  • first: Use the first best-matching sequence
  • toss: Consider unmapped
  • random: Select one best-matching sequence randomly
  • all: Use all best-matching sequences
genesets=f
Assign ambiguously-mapping reads to newly created gene sets for stats/rpkm output. May be slow.
clearzone=0
(cz) Threshold for ambiguity. If best match shares X kmers, reads are considered ambiguously mapped to sequences sharing at least [X minus clearzone] kmers.
czf=0.0
(clearzonefraction) If positive, actual clearzone = max(cz, czf*N) where N is total kmers.
ecco=f
For overlapping paired reads only. Performs error-correction with BBMerge prior to kmer operations.

Containment parameters

processcontainedref=f
Require a reference sequence to be fully contained by an input sequence.
storerefbases=f
Store reference bases so containments can be validated. If false with processcontainedref=true, only requires same number of shared bases as present in reference.

Taxonomy parameters (only use when doing taxonomy)

tax=<file>
Output destination for taxonomy information.
taxtree=<file>
(tree) A serialized TaxTree (tree.taxtree.gz) for taxonomic classification.
gi=<file>
A serialized GiTable (gitable.int1d.gz). Only needed if reference sequence names start with 'gi|'.
mincount=1
Only display taxa with at least this many hits.
maxnodes=-1
If positive, display at most this many top hits.
minlevel=subspecies
Do not display nodes below this taxonomic level. Valid levels: subspecies, species, genus, family, order, class, phylum, kingdom, domain, life.
maxlevel=life
Do not display nodes above this taxonomic level.

Speed and Memory parameters

threads=auto
(t) Number of threads to use; default is number of logical processors.
prealloc=f
Preallocate memory in kmer table. Allows faster table loading and more efficient memory usage for large references, but uses more memory upfront.
monitor=f
Kill process if CPU usage drops to zero for extended time. Format: monitor=600,0.01 kills after 600 seconds under 1% usage.
rskip=1
Skip reference kmers to reduce memory usage. 1=use all, 2=every other kmer, etc.
qskip=1
Skip query kmers to increase speed. 1=use all kmers.
speed=0
Ignore this fraction of kmer space (0-15 out of 16) in both reads and reference. Increases speed and reduces memory. Do not use with qskip or rskip.

Trimming/Masking parameters

qtrim=f
Trim read ends to remove bases with quality below trimq. Performed AFTER kmer matching. Values: t (both ends), f (neither), r (right only), l (left only).
trimq=6
Regions with average quality BELOW this threshold will be trimmed.
minlength=1
(ml) Reads shorter than this after trimming will be discarded. Pairs discarded only if both reads are shorter.
maxlength=
Reads longer than this after trimming will be discarded. Pairs discarded only if both reads are longer.
minavgquality=0
(maq) Reads with average quality (after trimming) below this will be discarded.
maqb=0
If positive, calculate average quality from this many initial bases only.
maxns=-1
If non-negative, reads with more Ns than this (after trimming) will be discarded.
forcetrimleft=0
(ftl) If positive, trim bases to the left of this position (exclusive, 0-based).
forcetrimright=0
(ftr) If positive, trim bases to the right of this position (exclusive, 0-based).
forcetrimright2=0
(ftr2) If positive, trim this many bases from the right end.
forcetrimmod=0
(ftm) If positive, right-trim length to be equal to zero modulo this number.
restrictleft=0
If positive, only look for kmer matches in the leftmost X bases of reads.
restrictright=0
If positive, only look for kmer matches in the rightmost X bases of reads.

Java Parameters

-Xmx
Set Java's memory usage, overriding autodetection. -Xmx20g specifies 20GB RAM, -Xmx200m specifies 200MB. Maximum is typically 85% of physical memory.
-eoom
Exit if out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions for slight performance improvement.

Algorithm Details

Multi-Reference Kmer Architecture

SEAL's core advantage over BBDuk lies in its multi-reference kmer mapping capability. While BBDuk associates each kmer with a single reference (first encountered), SEAL stores unlimited reference associations per kmer using IntList structures. This enables analysis of related sequences that share kmers, such as splice variants sharing exons or closely related species sharing genomic regions.

Kmer Table Implementation

  • ARRAYHF (default): Hybrid array structure optimized for single-threaded access with efficient memory allocation
  • ARRAYH: Thread-safe hybrid array for concurrent processing workloads
  • ARRAY2D/1D: Linear array structures for predictable memory patterns
  • FOREST2D: Tree-based organization for very large reference sets

Memory Usage Characteristics

SEAL requires approximately 20 bytes per unique kmer, similar to BBDuk. However, non-unique kmers (those present in multiple references) incur additional storage costs: approximately 32 bytes overhead plus 4 bytes per additional reference copy. This memory increase is the tradeoff for multi-reference capability.

Memory vs Sensitivity Tradeoffs

  • Standard mode: ~20 bytes/base for comprehensive kmer coverage
  • rskip parameter: Reduce memory by skipping reference kmers (rskip=2 uses every other kmer)
  • speed parameter: Probabilistic kmer space sampling for memory reduction
  • Lower-memory mode: Arbitrarily low memory usage possible with sensitivity reduction

RNA-seq Quantification Algorithm

SEAL's kmer-based approach naturally handles RNA-seq complexities that challenge traditional aligners:

Alternative Splicing Handling

By matching exonic kmers directly, SEAL avoids alignment gaps at splice junctions. A read spanning two exons will have kmers matching both exonic sequences, enabling quantification regardless of splice variant. This approach is particularly effective for isoform discrimination when combined with appropriate clearzone settings.

RPKM Calculation

Normalized expression calculation: RPKM = (reads × 1,000,000,000) ÷ (total_mapped_reads × reference_length). The algorithm accounts for both library size normalization and gene length normalization, essential for comparing expression across genes and samples.

Performance Optimization Strategies

Speed vs Accuracy Modes

  • match=all: Counts all kmers for maximum accuracy (slowest)
  • match=first: Stops after first kmer match (fastest, less accurate)
  • match=unique: Stops after first unique kmer (intermediate performance)

Parallel Processing Architecture

SEAL implements fine-grained parallelization with separate thread pools for I/O operations, kmer table access, and statistics accumulation. Thread-safe AtomicLongArray structures enable concurrent kmer counting without locks.

SEAL vs BBSplit Comparison

Both tools perform read binning and quantification, but with different approaches:

Feature SEAL BBSplit
Speed Generally faster Slower, especially for long reads
Memory Usage ~20 bytes/base ~6 bytes/base
Max Read Length Unlimited 6000bp maximum (600bp default)
Sensitivity/Specificity Good, kmer-based Higher, alignment-based
Best Use Case RNA-seq, long sequences, assemblies Short read contamination screening

Usage Guidelines

RNA-seq Best Practices

  • Use k=31 for optimal specificity/sensitivity balance
  • Enable reverse complement matching (rcomp=t, default)
  • Generate RPKM output for expression quantification
  • Consider clearzone parameter for splice junction handling
  • Use ambig=random for unbiased isoform assignment

Multi-Genome Analysis

  • Use fuse.sh to concatenate contigs before SEAL analysis
  • Set refnames=t for per-genome rather than per-contig statistics
  • Consider clearzone settings for closely related organisms
  • Use ambig=all to see all genome associations

Memory Optimization

  • Enable prealloc=t for large reference sets
  • Use rskip parameter to reduce memory usage
  • Consider speed parameter for memory-constrained systems
  • Monitor memory usage with large multi-reference datasets

Quality Control

  • Use built-in reference keywords (adapters, phix, etc.) for contamination screening
  • Generate statistics outputs to assess reference coverage
  • Use summarizeseal.sh for cross-contamination analysis
  • Enable nzo=t to focus on detected references

Related Tools

Support

For questions and support: