GradeSam
Grades mapping correctness of a sam file of synthetic reads with headers generated by RandomReads3.java. This tool evaluates how well mappers perform by comparing their alignment results to the known true positions of synthetic reads, providing both strict and loose correctness metrics.
Basic Usage
gradesam.sh in=<sam file> reads=<number of reads>
The tool requires a SAM file produced by mapping synthetic reads (generated by RandomReads3) and the number of reads that were in the original input to calculate accurate percentages.
Parameters
Parameters control how the grading process evaluates mapping correctness and handles different mapper output formats.
Input/Output Parameters
- in=<file>
- Specify the input sam file, or stdin. Must be a SAM file containing alignments of synthetic reads generated by RandomReads3.
- reads=<int>
- Number of reads in mapper's input (i.e., the fastq file). Required for calculating accurate percentages in the statistics output.
- outloose=<file>
- Output file for reads that fail loose correctness criteria (false positives in loose evaluation).
- outstrict=<file>
- Output file for reads that fail strict correctness criteria (false positives in strict evaluation).
- outl=<file>
- Alias for outloose parameter.
- outs=<file>
- Alias for outstrict parameter.
Correctness Evaluation Parameters
- thresh=20
- Max deviation from correct location to be considered 'loosely correct'. Only applies to loose correctness evaluation where one end needs to be approximately correct within this threshold.
- quality=3
- Reads with a mapping quality of this or below will be considered ambiguously mapped. These reads are counted separately from true/false positives.
- minq=3
- Alias for quality parameter.
- q=3
- Alias for quality parameter.
Mapper-Specific Parameters
- blasr=f
- Set to 't' for BLASR output; fixes extra information added to read names. BLASR appends additional information to read names that needs to be stripped for correct evaluation.
- ssaha2=f
- Set to 't' for SSAHA2 or SMALT output; fixes incorrect soft-clipped read locations. These mappers report positions differently for soft-clipped alignments.
Processing Parameters
- bitset=t
- Track read ID's to detect secondary alignments. Necessary for mappers that incorrectly output multiple primary alignments per read. Uses a BitSet data structure to efficiently track which reads have been seen.
- parsecustom=t
- Parse custom headers from RandomReads3 to extract true mapping positions. Must be enabled for correctness evaluation to work properly.
- printerr=f
- Set to true to print statistics to stderr in addition to stdout.
Examples
Basic Evaluation
gradesam.sh in=mapped_synthetic.sam reads=1000000
Evaluate the mapping correctness of a SAM file containing 1 million synthetic read alignments.
BLASR Output Evaluation
gradesam.sh in=blasr_output.sam reads=500000 blasr=t
Evaluate BLASR output, enabling BLASR-specific header parsing to handle read name modifications.
Custom Quality Threshold
gradesam.sh in=mapped.sam reads=1000000 quality=10 thresh=50
Use a higher mapping quality threshold (10) and looser position tolerance (50bp) for evaluation.
Save False Positives
gradesam.sh in=mapped.sam reads=1000000 outstrict=strict_fp.sam outloose=loose_fp.sam
Save false positive alignments to separate files for detailed analysis of mapping errors.
Algorithm Details
GradeSam evaluates mapping accuracy using a dual correctness model that distinguishes between strict and loose correctness criteria:
Correctness Evaluation Strategy
- Strict Correctness: Both alignment start and stop positions must match exactly with the true positions from the synthetic read headers
- Loose Correctness: Either the start OR stop position must be within the threshold distance (default 20bp) of the true position
- Strand and Reference: Both correctness modes require exact strand and reference sequence matching
Memory Management
The tool uses an intelligent BitSet allocation strategy for tracking seen read IDs:
- Attempts to allocate a BitSet sized to the expected number of reads
- Falls back to a default size of 400,000 if read count is unknown or too large
- Gracefully handles memory allocation failures by disabling duplicate detection
- BitSet enables O(1) lookup time for detecting secondary alignments
Read Processing Pipeline
- Header Parsing: Extract true alignment position from RandomReads3 custom headers using CustomHeader class
- Primary Filter: Process only primary alignments unless BitSet tracking is disabled
- Quality Assessment: Categorize reads as mapped, unmapped, discarded, or ambiguous based on mapping quality
- Correctness Evaluation: Compare predicted vs actual positions using both strict and loose criteria
- Statistics Accumulation: Track true positives, false positives, and false negatives for both correctness modes
Performance Characteristics
- Memory Usage: Configurable but typically uses ~200MB for SAM parsing plus BitSet allocation
- Processing Speed: Single-pass linear scan through SAM file with O(1) duplicate detection
- Scalability: Handles large SAM files efficiently with streaming I/O
- Precision: Exact position matching for synthetic read evaluation
Output Statistics
The tool provides comprehensive mapping statistics including:
- Primary and secondary alignment counts
- Mapping, retention, and discard percentages
- True positive and false positive rates for both strict and loose criteria
- False negative rate based on unmapped reads
- Ambiguous mapping detection based on quality thresholds
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org