GradeSam

Script: gradesam.sh Package: align2 Class: GradeSamFile.java

Grades mapping correctness of a sam file of synthetic reads with headers generated by RandomReads3.java. This tool evaluates how well mappers perform by comparing their alignment results to the known true positions of synthetic reads, providing both strict and loose correctness metrics.

Basic Usage

gradesam.sh in=<sam file> reads=<number of reads>

The tool requires a SAM file produced by mapping synthetic reads (generated by RandomReads3) and the number of reads that were in the original input to calculate accurate percentages.

Parameters

Parameters control how the grading process evaluates mapping correctness and handles different mapper output formats.

Input/Output Parameters

in=<file>: Specify the input sam file, or stdin. Must be a SAM file containing alignments of synthetic reads generated by RandomReads3.
reads=<int>: Number of reads in mapper's input (i.e., the fastq file). Required for calculating accurate percentages in the statistics output.
outloose=<file>: Output file for reads that fail loose correctness criteria (false positives in loose evaluation).
outstrict=<file>: Output file for reads that fail strict correctness criteria (false positives in strict evaluation).
outl=<file>: Alias for outloose parameter.
outs=<file>: Alias for outstrict parameter.

Correctness Evaluation Parameters

thresh=20: Max deviation from correct location to be considered 'loosely correct'. Only applies to loose correctness evaluation where one end needs to be approximately correct within this threshold.
quality=3: Reads with a mapping quality of this or below will be considered ambiguously mapped. These reads are counted separately from true/false positives.
minq=3: Alias for quality parameter.
q=3: Alias for quality parameter.

Mapper-Specific Parameters

blasr=f: Set to 't' for BLASR output; fixes extra information added to read names. BLASR appends additional information to read names that needs to be stripped for correct evaluation.
ssaha2=f: Set to 't' for SSAHA2 or SMALT output; fixes incorrect soft-clipped read locations. These mappers report positions differently for soft-clipped alignments.

Processing Parameters

bitset=t: Track read ID's to detect secondary alignments. Necessary for mappers that incorrectly output multiple primary alignments per read. Uses a BitSet data structure to efficiently track which reads have been seen.
parsecustom=t: Parse custom headers from RandomReads3 to extract true mapping positions. Must be enabled for correctness evaluation to work properly.
printerr=f: Set to true to print statistics to stderr in addition to stdout.

Examples

Basic Evaluation

gradesam.sh in=mapped_synthetic.sam reads=1000000

Evaluate the mapping correctness of a SAM file containing 1 million synthetic read alignments.

BLASR Output Evaluation

gradesam.sh in=blasr_output.sam reads=500000 blasr=t

Evaluate BLASR output, enabling BLASR-specific header parsing to handle read name modifications.

Custom Quality Threshold

gradesam.sh in=mapped.sam reads=1000000 quality=10 thresh=50

Use a higher mapping quality threshold (10) and looser position tolerance (50bp) for evaluation.

Save False Positives

gradesam.sh in=mapped.sam reads=1000000 outstrict=strict_fp.sam outloose=loose_fp.sam

Save false positive alignments to separate files for detailed analysis of mapping errors.

Algorithm Details

GradeSam evaluates mapping accuracy using a dual correctness model that distinguishes between strict and loose correctness criteria:

Correctness Evaluation Strategy

Strict Correctness: Both alignment start and stop positions must match exactly with the true positions from the synthetic read headers
Loose Correctness: Either the start OR stop position must be within the threshold distance (default 20bp) of the true position
Strand and Reference: Both correctness modes require exact strand and reference sequence matching

Memory Management

The tool uses an intelligent BitSet allocation strategy for tracking seen read IDs:

Attempts to allocate a BitSet sized to the expected number of reads
Falls back to a default size of 400,000 if read count is unknown or too large
Gracefully handles memory allocation failures by disabling duplicate detection
BitSet enables O(1) lookup time for detecting secondary alignments

Read Processing Pipeline

Header Parsing: Extract true alignment position from RandomReads3 custom headers using CustomHeader class
Primary Filter: Process only primary alignments unless BitSet tracking is disabled
Quality Assessment: Categorize reads as mapped, unmapped, discarded, or ambiguous based on mapping quality
Correctness Evaluation: Compare predicted vs actual positions using both strict and loose criteria
Statistics Accumulation: Track true positives, false positives, and false negatives for both correctness modes

Performance Characteristics

Memory Usage: Configurable but typically uses ~200MB for SAM parsing plus BitSet allocation
Processing Speed: Single-pass linear scan through SAM file with O(1) duplicate detection
Scalability: Handles large SAM files efficiently with streaming I/O
Precision: Exact position matching for synthetic read evaluation

Output Statistics

The tool provides comprehensive mapping statistics including:

Primary and secondary alignment counts
Mapping, retention, and discard percentages
True positive and false positive rates for both strict and loose criteria
False negative rate based on unmapped reads
Ambiguous mapping detection based on quality thresholds

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org