GradeSam

Script: gradesam.sh Package: align2 Class: GradeSamFile.java

Grades mapping correctness of a sam file of synthetic reads with headers generated by RandomReads3.java. This tool evaluates how well mappers perform by comparing their alignment results to the known true positions of synthetic reads, providing both strict and loose correctness metrics.

Basic Usage

gradesam.sh in=<sam file> reads=<number of reads>

The tool requires a SAM file produced by mapping synthetic reads (generated by RandomReads3) and the number of reads that were in the original input to calculate accurate percentages.

Parameters

Parameters control how the grading process evaluates mapping correctness and handles different mapper output formats.

Input/Output Parameters

in=<file>
Specify the input sam file, or stdin. Must be a SAM file containing alignments of synthetic reads generated by RandomReads3.
reads=<int>
Number of reads in mapper's input (i.e., the fastq file). Required for calculating accurate percentages in the statistics output.
outloose=<file>
Output file for reads that fail loose correctness criteria (false positives in loose evaluation).
outstrict=<file>
Output file for reads that fail strict correctness criteria (false positives in strict evaluation).
outl=<file>
Alias for outloose parameter.
outs=<file>
Alias for outstrict parameter.

Correctness Evaluation Parameters

thresh=20
Max deviation from correct location to be considered 'loosely correct'. Only applies to loose correctness evaluation where one end needs to be approximately correct within this threshold.
quality=3
Reads with a mapping quality of this or below will be considered ambiguously mapped. These reads are counted separately from true/false positives.
minq=3
Alias for quality parameter.
q=3
Alias for quality parameter.

Mapper-Specific Parameters

blasr=f
Set to 't' for BLASR output; fixes extra information added to read names. BLASR appends additional information to read names that needs to be stripped for correct evaluation.
ssaha2=f
Set to 't' for SSAHA2 or SMALT output; fixes incorrect soft-clipped read locations. These mappers report positions differently for soft-clipped alignments.

Processing Parameters

bitset=t
Track read ID's to detect secondary alignments. Necessary for mappers that incorrectly output multiple primary alignments per read. Uses a BitSet data structure to efficiently track which reads have been seen.
parsecustom=t
Parse custom headers from RandomReads3 to extract true mapping positions. Must be enabled for correctness evaluation to work properly.
printerr=f
Set to true to print statistics to stderr in addition to stdout.

Examples

Basic Evaluation

gradesam.sh in=mapped_synthetic.sam reads=1000000

Evaluate the mapping correctness of a SAM file containing 1 million synthetic read alignments.

BLASR Output Evaluation

gradesam.sh in=blasr_output.sam reads=500000 blasr=t

Evaluate BLASR output, enabling BLASR-specific header parsing to handle read name modifications.

Custom Quality Threshold

gradesam.sh in=mapped.sam reads=1000000 quality=10 thresh=50

Use a higher mapping quality threshold (10) and looser position tolerance (50bp) for evaluation.

Save False Positives

gradesam.sh in=mapped.sam reads=1000000 outstrict=strict_fp.sam outloose=loose_fp.sam

Save false positive alignments to separate files for detailed analysis of mapping errors.

Algorithm Details

GradeSam evaluates mapping accuracy using a dual correctness model that distinguishes between strict and loose correctness criteria:

Correctness Evaluation Strategy

Memory Management

The tool uses an intelligent BitSet allocation strategy for tracking seen read IDs:

Read Processing Pipeline

  1. Header Parsing: Extract true alignment position from RandomReads3 custom headers using CustomHeader class
  2. Primary Filter: Process only primary alignments unless BitSet tracking is disabled
  3. Quality Assessment: Categorize reads as mapped, unmapped, discarded, or ambiguous based on mapping quality
  4. Correctness Evaluation: Compare predicted vs actual positions using both strict and loose criteria
  5. Statistics Accumulation: Track true positives, false positives, and false negatives for both correctness modes

Performance Characteristics

Output Statistics

The tool provides comprehensive mapping statistics including:

Support

For questions and support: