Seal
SEAL (Sequence Expression AnaLyzer) performs high-speed alignment-free sequence quantification by counting long kmers that match between reads and reference sequences. Unlike BBDuk, which associates one kmer with one reference, SEAL can associate kmers with multiple references, making it ideal for RNA-seq with alternative splicing, related organisms, and isoform quantification.
Key Features
Multi-Reference Kmer Mapping
Unlike BBDuk's one-kmer-to-one-reference limitation, SEAL allows kmers to be associated with unlimited references, enabling analysis of related organisms, gene isoforms, and sequences sharing common regions.
RNA-seq Specialization
Designed specifically for RNA-seq applications with alternative splicing support. Handles splice junctions naturally through kmer-based matching without alignment gaps.
High-Speed Processing
Generally faster than BBSplit for read binning and quantification, though uses approximately 3x memory (20 bytes/base vs 6 bytes/base) with unlimited read length support.
Flexible Output Modes
Supports expression quantification, read binning, statistics generation, and taxonomic classification with multiple ambiguity handling strategies.
Basic Usage
seal.sh in=<file> ref=<references> pattern=out_%.fq outu=unmapped.fq stats=stats.txt
RNA-seq Expression Quantification
# Basic transcript quantification with RPKM output
seal.sh in=reads.fq ref=transcripts.fa stats=sealstats.txt rpkm=sealrpkm.txt ambig=random
# Multiple reference files with per-file statistics
seal.sh in=reads.fq ref=ref1.fa,ref2.fa,ref3.fa refstats=refstats.txt
Read Binning by Best Match
# Split reads into files by organism match
seal.sh in=reads.fq ref=bacterial_genomes.fa pattern=out_%.fq outu=unmapped.fq ambig=all
# With compressed output
seal.sh in=reads.fq *.fasta.gz pattern=out_%.fq.gz outu=unmapped.fq.gz
Taxonomic Classification
# Display taxonomic information at phylum level
seal.sh in=reads.fq ref=organisms.fasta minlevel=phylum maxlevel=phylum \
tax=tax_results.txt tree=tree.taxtree.gz gi=gitable.int1d.gz
Statistics Summarization
# Summarize multiple SEAL runs for cross-contamination analysis
summarizeseal.sh sealstats*.txt out=summary.txt
Understanding SEAL's Core Concepts
Ambiguity Modes
SEAL handles reads that match multiple references equally well through four strategies:
- random (default): Randomly selects one best-matching reference. Every read assigned to exactly one reference.
- first: Uses the first best-matching reference sequence encountered.
- toss: Considers ambiguously-mapped reads as unmapped.
- all: Assigns reads to all best-matching references (total assignments > input reads if ambiguous matches exist).
Example: If a read shares 2 kmers with reference A, 2 with B, and 1 with C, it will choose randomly between A and B (both better than C) with ambig=random.
Clearzone Parameter
Controls ambiguity sensitivity by setting the maximum kmer match difference between best and worst references considered equivalent:
- clearzone=0 (default): Only exact ties are ambiguous
- clearzone=2: References within 2 kmer hits of the best are considered tied
- clearzone=large number: All references sharing any kmers are considered tied
Example: Read shares 10 kmers with ref A, 8 with B, 3 with C. At clearzone=2, A and B are ambiguous. At clearzone=7, A, B, and C are all ambiguous.
Match Modes
Determines kmer counting strategy with speed/accuracy tradeoffs:
- all (default): Count all kmers in each read (slowest, most accurate)
- first: Stop after first matching kmer (fastest, less accurate)
- unique: Stop after first uniquely matching kmer (intermediate speed/accuracy)
Reference Name Handling
SEAL tracks statistics per sequence by default. For multi-contig references:
- Default behavior: One statistics line per reference sequence (can be many lines per genome)
- refnames=t: Statistics reported per reference file rather than per sequence
- fuse.sh preprocessing: Concatenate sequences into single sequence per file (recommended for genomes)
Parameters
SEAL parameters are organized into functional groups matching the shell script organization. All parameters from the original implementation are preserved to maintain compatibility.
Input parameters
- in=<file>
- Main input file. Use in=stdin.fq to pipe from stdin. Accepts fasta or fastq, compressed or uncompressed.
- in2=<file>
- Input for 2nd read of pairs in a different file.
- ref=<file,file>
- Comma-delimited list of reference files or directories. Filenames may also be used without ref=, e.g. *.fa. Supports keywords: adapters, artifacts, phix, lambda, pjet, mtst, kapa.
- literal=<seq,seq>
- Comma-delimited list of literal reference sequences for direct sequence matching.
- touppercase=f
- (tuc) Change all bases to upper-case during processing.
- interleaved=auto
- (int) Override interleaved autodetection with t/f. Must be set manually when streaming fastq input.
- qin=auto
- Input quality offset: 33 (Sanger), 64 (Illumina 1.3+), or auto for automatic detection.
- reads=-1
- If positive, quit after processing X reads or pairs. Useful for testing on subsets.
- copyundefined=f
- (cu) Process non-AGCT IUPAC reference bases by making all possible unambiguous copies. Intended for short motifs or adapter barcodes, as time/memory use is exponential.
Output parameters
- out=<file>
- (outmatch) Write reads containing kmers matching the reference. Use 'out=stdout.fq' to pipe to standard out.
- out2=<file>
- (outmatch2) Write 2nd read of pairs to a different file when using separate output files.
- outu=<file>
- (outunmatched) Write reads that do not contain kmers matching the database.
- outu2=<file>
- (outunmatched2) Write 2nd read of unmatched pairs to a different file.
- pattern=<file>
- Write reads to one stream per reference sequence match, replacing % with sequence name. Example: pattern=%.fq creates dog.fq and cat.fq for references named dog and cat.
- stats=<file>
- Write statistics about which references were detected and their hit counts.
- refstats=<file>
- Write statistics on a per-reference-file basis rather than per-sequence.
- rpkm=<file>
- Write RPKM (Reads Per Kilobase Million) for each reference sequence, essential for RNA-seq quantification.
- dump=<file>
- Dump kmer tables to a file in fasta format for debugging or analysis.
- nzo=t
- Only write statistics about reference sequences with nonzero hits to reduce output size.
- overwrite=t
- (ow) Grant permission to overwrite existing output files.
- showspeed=t
- (ss) Set to 'f' to suppress display of processing speed statistics.
- ziplevel=2
- (zl) Compression level for output; 1 (minimum/fastest) through 9 (maximum/slowest).
- fastawrap=80
- Length of lines in fasta output files.
- qout=auto
- Output quality offset: 33 (Sanger), 64 (Illumina 1.3+), or auto to match input.
- statscolumns=5
- (cols) Number of columns for stats output: 3 (basic) or 5 (includes base counts).
- rename=f
- Append matched reference sequence names to read headers for tracking.
- addcount=t
- If renaming, include the reference hit counts in the header.
- tophitonly=f
- If renaming, only add the top hit rather than all hits.
- refnames=f
- Use names of reference files rather than scaffold IDs. More efficient for multiple reference files than tracking statistics on a per-sequence basis.
- trd=f
- Truncate read and reference names at the first whitespace character.
- ordered=f
- Set to true to output reads in the same order as input (slower but preserves order).
- kpt=t
- (keepPairsTogether) Paired reads will always be assigned to the same reference sequence.
Processing parameters
- k=31
- Kmer length used for finding matches. References shorter than k will not be found. Must be at least 1, maximum 31. Longer kmers are more specific but less sensitive.
- rcomp=t
- Look for reverse-complements of kmers in addition to forward kmers, essential for most applications.
- maskmiddle=t
- (mm) Treat the middle base(s) of a kmer as wildcards to increase sensitivity with sequencing errors. Can be set to a number (e.g., mm=3) to mask that many bp. Default mm=t corresponds to mm=1 for odd k, mm=2 for even k.
- mm
- Alias for maskmiddle. Treat the middle base(s) of a kmer as wildcards to increase sensitivity with sequencing errors.
- minkmerhits=1
- (mkh) Minimum number of kmer hits required for a read to be considered a match.
- minkmerfraction=0.0
- (mkf) Minimum fraction of read kmers that must hit a reference for the read to be considered a match.
- hammingdistance=0
- (hdist) Maximum Hamming distance for reference kmers (substitutions only). Memory use is proportional to (3*K)^hdist.
- qhdist=0
- Hamming distance for query kmers; impacts speed but not memory usage.
- editdistance=0
- (edist) Maximum edit distance from reference kmers (substitutions and indels). Memory use is proportional to (8*K)^edist.
- forbidn=f
- (fn) Forbid matching of read kmers containing N. By default, N-containing kmers will match reference 'A' if hdist>0 or edist>0 to increase sensitivity.
- match=all
- Determines when to quit looking for kmer matches:
- all: Attempt to match all kmers in each read
- first: Quit after the first matching kmer
- unique: Quit after the first uniquely matching kmer
- ambiguous=random
- (ambig) Behavior for ambiguously-mapped reads (equal kmer matches to multiple sequences):
- first: Use the first best-matching sequence
- toss: Consider unmapped
- random: Select one best-matching sequence randomly
- all: Use all best-matching sequences
- genesets=f
- Assign ambiguously-mapping reads to newly created gene sets for stats/rpkm output. May be slow.
- clearzone=0
- (cz) Threshold for ambiguity. If best match shares X kmers, reads are considered ambiguously mapped to sequences sharing at least [X minus clearzone] kmers.
- czf=0.0
- (clearzonefraction) If positive, actual clearzone = max(cz, czf*N) where N is total kmers.
- ecco=f
- For overlapping paired reads only. Performs error-correction with BBMerge prior to kmer operations.
Containment parameters
- processcontainedref=f
- Require a reference sequence to be fully contained by an input sequence.
- storerefbases=f
- Store reference bases so containments can be validated. If false with processcontainedref=true, only requires same number of shared bases as present in reference.
Taxonomy parameters (only use when doing taxonomy)
- tax=<file>
- Output destination for taxonomy information.
- taxtree=<file>
- (tree) A serialized TaxTree (tree.taxtree.gz) for taxonomic classification.
- gi=<file>
- A serialized GiTable (gitable.int1d.gz). Only needed if reference sequence names start with 'gi|'.
- mincount=1
- Only display taxa with at least this many hits.
- maxnodes=-1
- If positive, display at most this many top hits.
- minlevel=subspecies
- Do not display nodes below this taxonomic level. Valid levels: subspecies, species, genus, family, order, class, phylum, kingdom, domain, life.
- maxlevel=life
- Do not display nodes above this taxonomic level.
Speed and Memory parameters
- threads=auto
- (t) Number of threads to use; default is number of logical processors.
- prealloc=f
- Preallocate memory in kmer table. Allows faster table loading and more efficient memory usage for large references, but uses more memory upfront.
- monitor=f
- Kill process if CPU usage drops to zero for extended time. Format: monitor=600,0.01 kills after 600 seconds under 1% usage.
- rskip=1
- Skip reference kmers to reduce memory usage. 1=use all, 2=every other kmer, etc.
- qskip=1
- Skip query kmers to increase speed. 1=use all kmers.
- speed=0
- Ignore this fraction of kmer space (0-15 out of 16) in both reads and reference. Increases speed and reduces memory. Do not use with qskip or rskip.
Trimming/Masking parameters
- qtrim=f
- Trim read ends to remove bases with quality below trimq. Performed AFTER kmer matching. Values: t (both ends), f (neither), r (right only), l (left only).
- trimq=6
- Regions with average quality BELOW this threshold will be trimmed.
- minlength=1
- (ml) Reads shorter than this after trimming will be discarded. Pairs discarded only if both reads are shorter.
- maxlength=
- Reads longer than this after trimming will be discarded. Pairs discarded only if both reads are longer.
- minavgquality=0
- (maq) Reads with average quality (after trimming) below this will be discarded.
- maqb=0
- If positive, calculate average quality from this many initial bases only.
- maxns=-1
- If non-negative, reads with more Ns than this (after trimming) will be discarded.
- forcetrimleft=0
- (ftl) If positive, trim bases to the left of this position (exclusive, 0-based).
- forcetrimright=0
- (ftr) If positive, trim bases to the right of this position (exclusive, 0-based).
- forcetrimright2=0
- (ftr2) If positive, trim this many bases from the right end.
- forcetrimmod=0
- (ftm) If positive, right-trim length to be equal to zero modulo this number.
- restrictleft=0
- If positive, only look for kmer matches in the leftmost X bases of reads.
- restrictright=0
- If positive, only look for kmer matches in the rightmost X bases of reads.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding autodetection. -Xmx20g specifies 20GB RAM, -Xmx200m specifies 200MB. Maximum is typically 85% of physical memory.
- -eoom
- Exit if out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions for slight performance improvement.
Algorithm Details
Multi-Reference Kmer Architecture
SEAL's core advantage over BBDuk lies in its multi-reference kmer mapping capability. While BBDuk associates each kmer with a single reference (first encountered), SEAL stores unlimited reference associations per kmer using IntList structures. This enables analysis of related sequences that share kmers, such as splice variants sharing exons or closely related species sharing genomic regions.
Kmer Table Implementation
- ARRAYHF (default): Hybrid array structure optimized for single-threaded access with efficient memory allocation
- ARRAYH: Thread-safe hybrid array for concurrent processing workloads
- ARRAY2D/1D: Linear array structures for predictable memory patterns
- FOREST2D: Tree-based organization for very large reference sets
Memory Usage Characteristics
SEAL requires approximately 20 bytes per unique kmer, similar to BBDuk. However, non-unique kmers (those present in multiple references) incur additional storage costs: approximately 32 bytes overhead plus 4 bytes per additional reference copy. This memory increase is the tradeoff for multi-reference capability.
Memory vs Sensitivity Tradeoffs
- Standard mode: ~20 bytes/base for comprehensive kmer coverage
- rskip parameter: Reduce memory by skipping reference kmers (rskip=2 uses every other kmer)
- speed parameter: Probabilistic kmer space sampling for memory reduction
- Lower-memory mode: Arbitrarily low memory usage possible with sensitivity reduction
RNA-seq Quantification Algorithm
SEAL's kmer-based approach naturally handles RNA-seq complexities that challenge traditional aligners:
Alternative Splicing Handling
By matching exonic kmers directly, SEAL avoids alignment gaps at splice junctions. A read spanning two exons will have kmers matching both exonic sequences, enabling quantification regardless of splice variant. This approach is particularly effective for isoform discrimination when combined with appropriate clearzone settings.
RPKM Calculation
Normalized expression calculation: RPKM = (reads × 1,000,000,000) ÷ (total_mapped_reads × reference_length). The algorithm accounts for both library size normalization and gene length normalization, essential for comparing expression across genes and samples.
Performance Optimization Strategies
Speed vs Accuracy Modes
- match=all: Counts all kmers for maximum accuracy (slowest)
- match=first: Stops after first kmer match (fastest, less accurate)
- match=unique: Stops after first unique kmer (intermediate performance)
Parallel Processing Architecture
SEAL implements fine-grained parallelization with separate thread pools for I/O operations, kmer table access, and statistics accumulation. Thread-safe AtomicLongArray structures enable concurrent kmer counting without locks.
SEAL vs BBSplit Comparison
Both tools perform read binning and quantification, but with different approaches:
Feature | SEAL | BBSplit |
---|---|---|
Speed | Generally faster | Slower, especially for long reads |
Memory Usage | ~20 bytes/base | ~6 bytes/base |
Max Read Length | Unlimited | 6000bp maximum (600bp default) |
Sensitivity/Specificity | Good, kmer-based | Higher, alignment-based |
Best Use Case | RNA-seq, long sequences, assemblies | Short read contamination screening |
Usage Guidelines
RNA-seq Best Practices
- Use k=31 for optimal specificity/sensitivity balance
- Enable reverse complement matching (rcomp=t, default)
- Generate RPKM output for expression quantification
- Consider clearzone parameter for splice junction handling
- Use ambig=random for unbiased isoform assignment
Multi-Genome Analysis
- Use fuse.sh to concatenate contigs before SEAL analysis
- Set refnames=t for per-genome rather than per-contig statistics
- Consider clearzone settings for closely related organisms
- Use ambig=all to see all genome associations
Memory Optimization
- Enable prealloc=t for large reference sets
- Use rskip parameter to reduce memory usage
- Consider speed parameter for memory-constrained systems
- Monitor memory usage with large multi-reference datasets
Quality Control
- Use built-in reference keywords (adapters, phix, etc.) for contamination screening
- Generate statistics outputs to assess reference coverage
- Use summarizeseal.sh for cross-contamination analysis
- Enable nzo=t to focus on detected references
Related Tools
- summarizeseal.sh: Summarizes multiple SEAL statistics files for cross-contamination analysis
- fuse.sh: Concatenates reference sequences to optimize SEAL memory usage
- BBDuk: Single-reference kmer matching for contamination removal and trimming
- BBSplit: Alignment-based read binning with higher sensitivity/specificity
- BBMerge: Error correction for overlapping pairs (used with ecco=t parameter)
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Detailed Guide: bbtools/docs/guides/SealGuide.txt