BBEst
Calculates EST (expressed sequence tags) capture by an assembly from a sam file. Designed to use BBMap output generated with these flags: k=13 maxindel=100000 customtag ordered
Basic Usage
bbest.sh in=<sam file> out=<stats file>
This tool processes SAM alignment files containing mapped ESTs (Expressed Sequence Tags) to generate mapping statistics and capture metrics. ESTs are often broken into smaller pieces during sequencing and mapping; this tool reassembles them and evaluates assembly capture efficiency.
Parameters
Parameters control input/output files, reference sequences, and mapping quality thresholds for EST capture analysis.
Input/Output Parameters
- in=<file>
- Specify a sam file (or stdin) containing mapped ESTs. The SAM file should be generated with BBMap using specific flags: k=13 maxindel=100000 customtag ordered. The 'ordered' flag is particularly important as the tool expects ESTs to appear in input order to properly reassemble multi-part sequences.
- out=<file>
- Specify the output stats file (default is stdout). The output contains detailed statistics about EST mapping including counts and percentages for different mapping categories, intron analysis, and capture efficiency metrics.
- ref=<file>
- Specify the reference file (optional). When provided, allows additional validation and analysis of the assembly used as mapping target.
- est=<file>
- Specify the EST fasta file (optional). When provided, enables additional analysis comparing original EST sequences with mapping results.
Analysis Parameters
- fraction=0.98
- Minimum fraction of bases mapped to reference to be considered 'all mapped'. ESTs with mapping coverage at or above this threshold are categorized as having all bases successfully captured by the assembly. Default value of 0.98 means 98% of EST bases must map to qualify as fully captured.
Examples
Basic EST Analysis
bbest.sh in=mapped_ests.sam out=est_stats.txt
Analyze EST mapping from a SAM file and output capture statistics to a text file.
Complete EST Analysis with Reference
bbest.sh in=mapped_ests.sam out=est_stats.txt ref=assembly.fasta est=original_ests.fasta
Analysis including reference assembly and original EST sequences for validation.
Custom Mapping Threshold
bbest.sh in=mapped_ests.sam out=est_stats.txt fraction=0.95
Use a lower threshold (95%) for considering ESTs as fully captured, which may be appropriate for more fragmented assemblies.
Processing from Standard Input
samtools view alignment.bam | bbest.sh in=stdin out=est_capture.txt
Process SAM data directly from a pipeline, useful for integrating with other tools.
Algorithm Details
EST Reassembly Strategy
The tool implements pattern-based EST reassembly using string parsing and HashMap tracking that handles the common scenario where long EST sequences are broken into smaller pieces for mapping. The algorithm:
- Part Detection: Identifies EST parts using the naming convention "_part[number]" suffix, extracting the base EST name
- Sequential Assembly: Requires input SAM file to be ordered (using BBMap 'ordered' flag) to properly group parts belonging to the same EST
- Length Calculation: Accumulates total length from all parts to reconstruct original EST size
- Mapping Assessment: Evaluates mapping quality across all parts to determine overall EST capture success
Mapping Quality Categories
ESTs are classified into four mapping quality categories based on the fraction of bases successfully mapped:
- All Bases Mapped: ESTs with ≥fraction (default 98%) of bases mapped to the reference
- Most Bases Mapped: ESTs with ≥50% but <fraction of bases mapped
- Some Bases Mapped: ESTs with >0% but <50% of bases mapped
- Zero Bases Mapped: ESTs with no successful mappings
Multi-Scaffold Analysis
The tool tracks ESTs that map to multiple scaffolds, which can indicate:
- Assembly fragmentation where single genes are split across scaffolds
- Chimeric EST sequences
- Repetitive sequences present in multiple locations
Intron Detection and Analysis
Splice junction analysis is performed by parsing CIGAR strings for deletion (D) and skipped region (N) operations:
- Intron Size Distribution: Calculates minimum, maximum, median, and average intron sizes
- Splice Site Validation: Large deletions/skips likely represent genuine introns in eukaryotic sequences
- Threshold Filtering: Applies minimum intron size filter (default 10bp) to reduce noise from small indels
Output Statistics Format
The tool generates output statistics including:
- Count Metrics: Number of reference scaffolds, ESTs processed, and base counts
- Percentage Metrics: Capture efficiency percentages for each mapping category
- Intron Statistics: Complete intron size distribution analysis
- Multi-scaffold Metrics: Count and percentage of ESTs mapping to multiple scaffolds
Memory Efficiency
The algorithm uses specific data structures to manage memory usage:
- Streaming Processing: Processes SAM file line-by-line without loading entire file into memory
- Hash-based Tracking: Uses HashMap<String, EST> for EST lookup and grouping by name
- Minimal Memory Footprint: Default memory allocation of only 120MB for most analyses
Performance Characteristics
The tool has the following processing characteristics:
- Linear Time Complexity: O(n) processing time where n is the number of SAM records
- Scalable Design: Processes records sequentially with constant memory overhead per EST
- I/O Implementation: Uses TextFile and TextStreamWriter classes for SAM input and statistics output
Output Format
The output statistics file contains the following sections:
File Information
- ref_file: Path to reference file (if provided)
- est_file: Path to EST file (if provided)
- sam_file: Path to input SAM file
Count Statistics
- n_ref_scaffolds: Number of reference scaffolds
- n_ref_bases: Total reference bases
- n_est: Total number of ESTs processed
- n_est_bases: Total EST bases
Mapping Categories
Each category shows: type, n_est, pct_est, n_bases, pct_bases
- all: ESTs with ≥fraction bases mapped
- most: ESTs with ≥50% but <fraction bases mapped
- some: ESTs with >0% but <50% bases mapped
- zero: ESTs with no mapped bases
- multi: ESTs mapping to multiple scaffolds
Intron Analysis
Final line contains: count, min, max, median, average intron sizes
Preprocessing Requirements
For optimal results, SAM files should be generated with specific BBMap parameters:
Required BBMap Flags
- k=13: Short kmer size appropriate for EST mapping
- maxindel=100000: Allow large indels to capture introns and structural variations
- customtag: Add custom tags for detailed alignment information
- ordered: Maintain input order crucial for EST part reassembly
EST Naming Convention
For proper part reassembly, EST parts should follow the naming pattern:
[EST_NAME]_part[NUMBER]
Example: EST12345_part1, EST12345_part2, EST12345_part3
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org