GradeMerge
Grades correctness of merging synthetic reads with headers generated by RandomReads and re-headered by RenameReads.
Basic Usage
grademerge.sh in=<file>
GradeMerge evaluates the quality of read merging by comparing merged synthetic reads against their known insert sizes. This tool is specifically designed for synthetic reads created with RandomReads that contain insert size information in their headers.
Parameters
GradeMerge accepts input files and optional raw reads for dual-stage merging analysis.
Input/Output Parameters
- in=<file>
- Specify the input file containing merged reads, or 'stdin'. The reads must have synthetic headers containing insert size information (e.g., "insert=250") for proper grading.
- raw=<file>
- Specify the original raw read pairs before merging. This allows calculation of what percentage of reads were theoretically mergeable. Use # symbol for paired files (raw=reads#.fq becomes reads1.fq and reads2.fq).
- raw1=<file>
- First file of raw paired reads. Use with raw2 to specify paired files explicitly.
- raw2=<file>
- Second file of raw paired reads. Used in conjunction with raw1.
Processing Parameters
- verbose=f
- Print additional processing information during execution. Set to true for detailed output about file processing stages.
Examples
Basic Merge Grading
grademerge.sh in=merged_reads.fq
Grades the correctness of merged reads by comparing actual merged length against the insert size embedded in synthetic read headers.
Dual-Stage Analysis with Raw Reads
grademerge.sh in=merged_reads.fq raw=raw_reads#.fq
Analyzes merged reads and also reports what percentage of the original raw read pairs were theoretically mergeable based on their insert sizes.
Verbose Processing
grademerge.sh in=merged_reads.fq raw1=reads_1.fq raw2=reads_2.fq verbose=t
Performs dual-stage analysis with detailed processing information printed to stderr.
Algorithm Details
Merge Quality Assessment
GradeMerge implements delta-based comparison logic to evaluate read merging accuracy using synthetic read headers containing known insert sizes:
Header Parsing Strategy
- SYN Headers: Uses CustomHeader constructor to parse synthetic read headers created by RandomReads
- Insert Headers: parseInsert() method extracts insert size from "insert=N" format headers using substring parsing
- Character-by-Character Parsing: Tools.isDigit() validation handles header modifications by other tools while preserving insert size information
Quality Classification
Each merged read is classified based on the comparison between its actual length and the expected insert size:
- Correct: Merged length exactly matches the insert size (delta = 0)
- Too Long: Merged length exceeds the insert size (delta > 0)
- Too Short: Merged length is less than the insert size (delta < 0)
Statistical Analysis
The tool computes statistics using counters for each classification category:
- Accuracy Percentages: Fraction of reads in each category (correct/too short/too long) using Tools.format("%.5f") precision
- Signal-to-Noise Ratio (SNR): Math.log10() calculation: 10 * log10((correct + incorrect + 0.0001) / (incorrect + 0.0001))
- Mergeability Analysis: When raw reads provided, determines theoretical merge percentage using insert < pairLength() condition
- Processing Throughput: Tools.timeReadsBasesProcessed() reports reads and bases processed per unit time
Dual Processing Strategy
When raw reads are provided, GradeMerge performs two-stage analysis using ConcurrentReadInputStream:
- Raw Analysis: Determines which read pairs could theoretically be merged based on insert size < r1.pairLength() condition
- Merged Analysis: Evaluates the quality of actually merged reads using delta=insert-initialLength1 calculation
Performance Characteristics
- Memory Configuration: Fixed 200MB heap allocation (-Xmx200m) for predictable memory usage
- Streaming Processing: Uses ListNum<Read> batching with ConcurrentReadInputStream for constant memory usage
- Format Support: FileFormat.testInput() handles FASTQ and FASTA formats, compressed or uncompressed
Output Metrics
GradeMerge provides detailed statistics including:
- Input total pairs and overlapping percentage (when raw reads provided)
- Correct merge percentage and absolute counts
- Error breakdown (too short vs too long)
- Signal-to-noise ratio for merge quality assessment
- Processing speed metrics (time, reads/second, bases/second)
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org