BBEst

Script: bbest.sh Package: jgi Class: SamToEst.java

Calculates EST (expressed sequence tags) capture by an assembly from a sam file. Designed to use BBMap output generated with these flags: k=13 maxindel=100000 customtag ordered

Basic Usage

bbest.sh in=<sam file> out=<stats file>

This tool processes SAM alignment files containing mapped ESTs (Expressed Sequence Tags) to generate mapping statistics and capture metrics. ESTs are often broken into smaller pieces during sequencing and mapping; this tool reassembles them and evaluates assembly capture efficiency.

Parameters

Parameters control input/output files, reference sequences, and mapping quality thresholds for EST capture analysis.

Input/Output Parameters

in=<file>
Specify a sam file (or stdin) containing mapped ESTs. The SAM file should be generated with BBMap using specific flags: k=13 maxindel=100000 customtag ordered. The 'ordered' flag is particularly important as the tool expects ESTs to appear in input order to properly reassemble multi-part sequences.
out=<file>
Specify the output stats file (default is stdout). The output contains detailed statistics about EST mapping including counts and percentages for different mapping categories, intron analysis, and capture efficiency metrics.
ref=<file>
Specify the reference file (optional). When provided, allows additional validation and analysis of the assembly used as mapping target.
est=<file>
Specify the EST fasta file (optional). When provided, enables additional analysis comparing original EST sequences with mapping results.

Analysis Parameters

fraction=0.98
Minimum fraction of bases mapped to reference to be considered 'all mapped'. ESTs with mapping coverage at or above this threshold are categorized as having all bases successfully captured by the assembly. Default value of 0.98 means 98% of EST bases must map to qualify as fully captured.

Examples

Basic EST Analysis

bbest.sh in=mapped_ests.sam out=est_stats.txt

Analyze EST mapping from a SAM file and output capture statistics to a text file.

Complete EST Analysis with Reference

bbest.sh in=mapped_ests.sam out=est_stats.txt ref=assembly.fasta est=original_ests.fasta

Analysis including reference assembly and original EST sequences for validation.

Custom Mapping Threshold

bbest.sh in=mapped_ests.sam out=est_stats.txt fraction=0.95

Use a lower threshold (95%) for considering ESTs as fully captured, which may be appropriate for more fragmented assemblies.

Processing from Standard Input

samtools view alignment.bam | bbest.sh in=stdin out=est_capture.txt

Process SAM data directly from a pipeline, useful for integrating with other tools.

Algorithm Details

EST Reassembly Strategy

The tool implements pattern-based EST reassembly using string parsing and HashMap tracking that handles the common scenario where long EST sequences are broken into smaller pieces for mapping. The algorithm:

Mapping Quality Categories

ESTs are classified into four mapping quality categories based on the fraction of bases successfully mapped:

Multi-Scaffold Analysis

The tool tracks ESTs that map to multiple scaffolds, which can indicate:

Intron Detection and Analysis

Splice junction analysis is performed by parsing CIGAR strings for deletion (D) and skipped region (N) operations:

Output Statistics Format

The tool generates output statistics including:

Memory Efficiency

The algorithm uses specific data structures to manage memory usage:

Performance Characteristics

The tool has the following processing characteristics:

Output Format

The output statistics file contains the following sections:

File Information

Count Statistics

Mapping Categories

Each category shows: type, n_est, pct_est, n_bases, pct_bases

Intron Analysis

Final line contains: count, min, max, median, average intron sizes

Preprocessing Requirements

For optimal results, SAM files should be generated with specific BBMap parameters:

Required BBMap Flags

EST Naming Convention

For proper part reassembly, EST parts should follow the naming pattern:

[EST_NAME]_part[NUMBER]

Example: EST12345_part1, EST12345_part2, EST12345_part3

Support

For questions and support: