SamToRoc
Creates a ROC curve from a sam file of synthetic reads with headers generated by RandomReads3.java
Basic Usage
samtoroc.sh in=<sam file> reads=<number of reads in input fastq>
SAMTOROC analyzes SAM files containing synthetic reads to generate ROC (Receiver Operating Characteristic) curves for evaluating mapping accuracy. The tool compares the actual mapping locations with the true locations embedded in the read headers by RandomReads3.java.
Parameters
Parameters control input/output specifications, accuracy thresholds, and format compatibility for different mapping tools.
Input/Output Parameters
- in=<file>
- Specify the input SAM file containing mapped synthetic reads, or stdin. The SAM file should contain reads with headers generated by RandomReads3.java that encode the true mapping locations.
- reads=<number>
- Total number of reads in the original input FASTQ file. This parameter is used to calculate false negative rates and percentage statistics in the ROC output.
Accuracy Parameters
- thresh=20
- Maximum deviation (in base pairs) from the correct location to be considered 'loosely correct'. Alignments within this threshold are classified as true positives in the loose accuracy category.
Mapper Compatibility Parameters
- blasr=f
- Set to 't' for BLASR output compatibility. Fixes extra information added to read names by BLASR that interferes with parsing the true location from RandomReads3 headers.
- ssaha2=f
- Set to 't' for SSAHA2 or SMALT output compatibility. Fixes incorrect soft-clipped read locations reported by these mappers to ensure accurate position comparison.
- bitset=t
- Track read IDs using a BitSet to detect secondary alignments. Necessary for mappers that incorrectly output multiple primary alignments per read, ensuring only one alignment per read is counted in statistics.
Advanced Parameters
- parsecustom=t
- Parse custom headers generated by RandomReads3.java to extract true mapping locations. This is essential for ROC curve generation and should remain enabled.
- allowspaceslash=t
- Allow spaces and slashes in contig names when comparing reference sequences. Helps with compatibility across different reference formats.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding autodetection. -Xmx20g specifies 20 gigabytes of RAM, -Xmx200m specifies 200 megabytes. The maximum is typically 85% of physical memory. Default is 200m for this lightweight tool.
- -eoom
- Exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to detect memory issues.
- -da
- Disable Java assertions. May provide a small performance improvement in production environments.
Examples
Basic ROC Analysis
samtoroc.sh in=mapped_reads.sam reads=100000
Generate a ROC curve from a SAM file containing 100,000 mapped synthetic reads. This will output statistics for different mapping quality thresholds.
BLASR Output Analysis
samtoroc.sh in=blasr_output.sam reads=50000 blasr=t thresh=30
Analyze BLASR output with BLASR-specific header parsing enabled and a looser accuracy threshold of 30bp.
SSAHA2/SMALT Analysis
samtoroc.sh in=ssaha2_output.sam reads=75000 ssaha2=t
Process SSAHA2 or SMALT output with position correction for soft-clipped reads.
Pipeline with Standard Input
samtools view alignment.bam | samtoroc.sh in=stdin reads=200000
Process SAM data from standard input, useful for pipeline integration without intermediate files.
Algorithm Details
ROC Curve Generation
SAMTOROC implements ROC analysis using 1000-element arrays for quality score binning and CustomHeader parsing to extract true mapping coordinates from RandomReads3.java headers:
Dual Accuracy Classification
- Strict Classification: Requires exact match of chromosome, strand, start, and stop positions between predicted and true locations
- Loose Classification: Allows deviation up to the threshold parameter (default 20bp) from true start or stop positions while maintaining chromosome and strand accuracy
Quality Score Stratification
The algorithm maintains 1000 bins (0-999) for mapping quality scores, tracking statistics at each quality threshold:
- True positives (strict and loose)
- False positives (strict and loose)
- Mapped reads retained
- Unmapped reads
- Discarded and ambiguous alignments
BitSet Duplicate Detection
Uses a BitSet data structure (default 400,000 bits, ~50KB memory) to track processed read IDs and detect multiple primary alignments from the same read. The BitSet size scales with read count when specified, preventing double-counting in ROC statistics.
CustomHeader Parsing
Parses RandomReads3.java header format to extract true mapping coordinates:
- Reference sequence name
- Genomic coordinates (start/stop)
- Strand orientation
- Pair number for paired-end reads
Mapper-Specific Compatibility
Includes specialized handling for different mapping tools:
- BLASR: Strips additional path information from read names that interferes with coordinate parsing
- SSAHA2/SMALT: Corrects position reporting for soft-clipped alignments
- General mappers: Handles various SAM format variations and edge cases
Output Format
Generates tab-delimited ROC data with columns: minScore, mapped, retained, truePositiveStrict, falsePositiveStrict, truePositiveLoose, falsePositiveLoose, falseNegative, discarded, ambiguous. Values are reported as percentages of total reads for easy comparison across datasets.
Output Interpretation
ROC Columns
- minScore
- Minimum mapping quality threshold for this row
- mapped
- Percentage of reads that received any mapping
- retained
- Percentage of reads retained after quality filtering
- truePositiveStrict
- Percentage of reads mapped to exact correct location
- falsePositiveStrict
- Percentage of reads mapped to incorrect location (strict)
- truePositiveLoose
- Percentage of reads mapped within threshold of correct location
- falsePositiveLoose
- Percentage of reads mapped beyond threshold (loose)
- falseNegative
- Percentage of reads that should have been mapped but weren't
- discarded
- Percentage of reads discarded by the mapper
- ambiguous
- Percentage of reads with ambiguous mapping
Performance Metrics
- Sensitivity: truePositive / (truePositive + falseNegative)
- Precision: truePositive / (truePositive + falsePositive)
- Specificity: trueNegative / (trueNegative + falsePositive)
Performance Considerations
Memory Usage
- Default: 200MB Java heap, suitable for most SAM files
- BitSet: ~50KB per 400,000 reads for duplicate detection
- Arrays: Fixed 1000-element arrays for quality score bins (~40KB)
- Scaling: Memory usage scales with number of unique read IDs when bitset is enabled
Processing Speed
- Single-pass SAM file reading with streaming processing
- O(1) lookup time for quality score binning
- O(1) duplicate detection with BitSet
- Linear time complexity O(n) where n is number of alignments
Input Requirements
- SAM format input (not BAM - use samtools view for conversion)
- Synthetic reads with RandomReads3.java headers containing true coordinates
- Known total read count for accurate percentage calculations
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org