SamToRoc

Script: samtoroc.sh Package: align2 Class: MakeRocCurve.java

Creates a ROC curve from a sam file of synthetic reads with headers generated by RandomReads3.java

Basic Usage

samtoroc.sh in=<sam file> reads=<number of reads in input fastq>

SAMTOROC analyzes SAM files containing synthetic reads to generate ROC (Receiver Operating Characteristic) curves for evaluating mapping accuracy. The tool compares the actual mapping locations with the true locations embedded in the read headers by RandomReads3.java.

Parameters

Parameters control input/output specifications, accuracy thresholds, and format compatibility for different mapping tools.

Input/Output Parameters

in=<file>
Specify the input SAM file containing mapped synthetic reads, or stdin. The SAM file should contain reads with headers generated by RandomReads3.java that encode the true mapping locations.
reads=<number>
Total number of reads in the original input FASTQ file. This parameter is used to calculate false negative rates and percentage statistics in the ROC output.

Accuracy Parameters

thresh=20
Maximum deviation (in base pairs) from the correct location to be considered 'loosely correct'. Alignments within this threshold are classified as true positives in the loose accuracy category.

Mapper Compatibility Parameters

blasr=f
Set to 't' for BLASR output compatibility. Fixes extra information added to read names by BLASR that interferes with parsing the true location from RandomReads3 headers.
ssaha2=f
Set to 't' for SSAHA2 or SMALT output compatibility. Fixes incorrect soft-clipped read locations reported by these mappers to ensure accurate position comparison.
bitset=t
Track read IDs using a BitSet to detect secondary alignments. Necessary for mappers that incorrectly output multiple primary alignments per read, ensuring only one alignment per read is counted in statistics.

Advanced Parameters

parsecustom=t
Parse custom headers generated by RandomReads3.java to extract true mapping locations. This is essential for ROC curve generation and should remain enabled.
allowspaceslash=t
Allow spaces and slashes in contig names when comparing reference sequences. Helps with compatibility across different reference formats.

Java Parameters

-Xmx
Set Java's memory usage, overriding autodetection. -Xmx20g specifies 20 gigabytes of RAM, -Xmx200m specifies 200 megabytes. The maximum is typically 85% of physical memory. Default is 200m for this lightweight tool.
-eoom
Exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to detect memory issues.
-da
Disable Java assertions. May provide a small performance improvement in production environments.

Examples

Basic ROC Analysis

samtoroc.sh in=mapped_reads.sam reads=100000

Generate a ROC curve from a SAM file containing 100,000 mapped synthetic reads. This will output statistics for different mapping quality thresholds.

BLASR Output Analysis

samtoroc.sh in=blasr_output.sam reads=50000 blasr=t thresh=30

Analyze BLASR output with BLASR-specific header parsing enabled and a looser accuracy threshold of 30bp.

SSAHA2/SMALT Analysis

samtoroc.sh in=ssaha2_output.sam reads=75000 ssaha2=t

Process SSAHA2 or SMALT output with position correction for soft-clipped reads.

Pipeline with Standard Input

samtools view alignment.bam | samtoroc.sh in=stdin reads=200000

Process SAM data from standard input, useful for pipeline integration without intermediate files.

Algorithm Details

ROC Curve Generation

SAMTOROC implements ROC analysis using 1000-element arrays for quality score binning and CustomHeader parsing to extract true mapping coordinates from RandomReads3.java headers:

Dual Accuracy Classification

Quality Score Stratification

The algorithm maintains 1000 bins (0-999) for mapping quality scores, tracking statistics at each quality threshold:

BitSet Duplicate Detection

Uses a BitSet data structure (default 400,000 bits, ~50KB memory) to track processed read IDs and detect multiple primary alignments from the same read. The BitSet size scales with read count when specified, preventing double-counting in ROC statistics.

CustomHeader Parsing

Parses RandomReads3.java header format to extract true mapping coordinates:

Mapper-Specific Compatibility

Includes specialized handling for different mapping tools:

Output Format

Generates tab-delimited ROC data with columns: minScore, mapped, retained, truePositiveStrict, falsePositiveStrict, truePositiveLoose, falsePositiveLoose, falseNegative, discarded, ambiguous. Values are reported as percentages of total reads for easy comparison across datasets.

Output Interpretation

ROC Columns

minScore
Minimum mapping quality threshold for this row
mapped
Percentage of reads that received any mapping
retained
Percentage of reads retained after quality filtering
truePositiveStrict
Percentage of reads mapped to exact correct location
falsePositiveStrict
Percentage of reads mapped to incorrect location (strict)
truePositiveLoose
Percentage of reads mapped within threshold of correct location
falsePositiveLoose
Percentage of reads mapped beyond threshold (loose)
falseNegative
Percentage of reads that should have been mapped but weren't
discarded
Percentage of reads discarded by the mapper
ambiguous
Percentage of reads with ambiguous mapping

Performance Metrics

Performance Considerations

Memory Usage

Processing Speed

Input Requirements

Support

For questions and support: