SamToRoc

Basic Usage

samtoroc.sh in=<sam file> reads=<number of reads in input fastq>

SAMTOROC analyzes SAM files containing synthetic reads to generate ROC (Receiver Operating Characteristic) curves for evaluating mapping accuracy. The tool compares the actual mapping locations with the true locations embedded in the read headers by RandomReads3.java.

Parameters

Parameters control input/output specifications, accuracy thresholds, and format compatibility for different mapping tools.

Input/Output Parameters

in=<file>: Specify the input SAM file containing mapped synthetic reads, or stdin. The SAM file should contain reads with headers generated by RandomReads3.java that encode the true mapping locations.
reads=<number>: Total number of reads in the original input FASTQ file. This parameter is used to calculate false negative rates and percentage statistics in the ROC output.

Accuracy Parameters

thresh=20: Maximum deviation (in base pairs) from the correct location to be considered 'loosely correct'. Alignments within this threshold are classified as true positives in the loose accuracy category.

Mapper Compatibility Parameters

blasr=f: Set to 't' for BLASR output compatibility. Fixes extra information added to read names by BLASR that interferes with parsing the true location from RandomReads3 headers.
ssaha2=f: Set to 't' for SSAHA2 or SMALT output compatibility. Fixes incorrect soft-clipped read locations reported by these mappers to ensure accurate position comparison.
bitset=t: Track read IDs using a BitSet to detect secondary alignments. Necessary for mappers that incorrectly output multiple primary alignments per read, ensuring only one alignment per read is counted in statistics.

Advanced Parameters

parsecustom=t: Parse custom headers generated by RandomReads3.java to extract true mapping locations. This is essential for ROC curve generation and should remain enabled.
allowspaceslash=t: Allow spaces and slashes in contig names when comparing reference sequences. Helps with compatibility across different reference formats.

Java Parameters

-Xmx: Set Java's memory usage, overriding autodetection. -Xmx20g specifies 20 gigabytes of RAM, -Xmx200m specifies 200 megabytes. The maximum is typically 85% of physical memory. Default is 200m for this lightweight tool.
-eoom: Exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to detect memory issues.
-da: Disable Java assertions. May provide a small performance improvement in production environments.

Examples

Basic ROC Analysis

samtoroc.sh in=mapped_reads.sam reads=100000

Generate a ROC curve from a SAM file containing 100,000 mapped synthetic reads. This will output statistics for different mapping quality thresholds.

BLASR Output Analysis

samtoroc.sh in=blasr_output.sam reads=50000 blasr=t thresh=30

Analyze BLASR output with BLASR-specific header parsing enabled and a looser accuracy threshold of 30bp.

SSAHA2/SMALT Analysis

samtoroc.sh in=ssaha2_output.sam reads=75000 ssaha2=t

Process SSAHA2 or SMALT output with position correction for soft-clipped reads.

Pipeline with Standard Input

samtools view alignment.bam | samtoroc.sh in=stdin reads=200000

Process SAM data from standard input, useful for pipeline integration without intermediate files.

Algorithm Details

ROC Curve Generation

SAMTOROC implements ROC analysis using 1000-element arrays for quality score binning and CustomHeader parsing to extract true mapping coordinates from RandomReads3.java headers:

Dual Accuracy Classification

Strict Classification: Requires exact match of chromosome, strand, start, and stop positions between predicted and true locations
Loose Classification: Allows deviation up to the threshold parameter (default 20bp) from true start or stop positions while maintaining chromosome and strand accuracy

Quality Score Stratification

The algorithm maintains 1000 bins (0-999) for mapping quality scores, tracking statistics at each quality threshold:

True positives (strict and loose)
False positives (strict and loose)
Mapped reads retained
Unmapped reads
Discarded and ambiguous alignments

BitSet Duplicate Detection

Uses a BitSet data structure (default 400,000 bits, ~50KB memory) to track processed read IDs and detect multiple primary alignments from the same read. The BitSet size scales with read count when specified, preventing double-counting in ROC statistics.

CustomHeader Parsing

Parses RandomReads3.java header format to extract true mapping coordinates:

Reference sequence name
Genomic coordinates (start/stop)
Strand orientation
Pair number for paired-end reads

Mapper-Specific Compatibility

Includes specialized handling for different mapping tools:

BLASR: Strips additional path information from read names that interferes with coordinate parsing
SSAHA2/SMALT: Corrects position reporting for soft-clipped alignments
General mappers: Handles various SAM format variations and edge cases

Output Format

Generates tab-delimited ROC data with columns: minScore, mapped, retained, truePositiveStrict, falsePositiveStrict, truePositiveLoose, falsePositiveLoose, falseNegative, discarded, ambiguous. Values are reported as percentages of total reads for easy comparison across datasets.

Output Interpretation

ROC Columns

minScore: Minimum mapping quality threshold for this row
mapped: Percentage of reads that received any mapping
retained: Percentage of reads retained after quality filtering
truePositiveStrict: Percentage of reads mapped to exact correct location
falsePositiveStrict: Percentage of reads mapped to incorrect location (strict)
truePositiveLoose: Percentage of reads mapped within threshold of correct location
falsePositiveLoose: Percentage of reads mapped beyond threshold (loose)
falseNegative: Percentage of reads that should have been mapped but weren't
discarded: Percentage of reads discarded by the mapper
ambiguous: Percentage of reads with ambiguous mapping

Performance Metrics

Sensitivity: truePositive / (truePositive + falseNegative)
Precision: truePositive / (truePositive + falsePositive)
Specificity: trueNegative / (trueNegative + falsePositive)

Performance Considerations

Memory Usage

Default: 200MB Java heap, suitable for most SAM files
BitSet: ~50KB per 400,000 reads for duplicate detection
Arrays: Fixed 1000-element arrays for quality score bins (~40KB)
Scaling: Memory usage scales with number of unique read IDs when bitset is enabled

Processing Speed

Single-pass SAM file reading with streaming processing
O(1) lookup time for quality score binning
O(1) duplicate detection with BitSet
Linear time complexity O(n) where n is number of alignments

Input Requirements

SAM format input (not BAM - use samtools view for conversion)
Synthetic reads with RandomReads3.java headers containing true coordinates
Known total read count for accurate percentage calculations

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org