GradeMerge

Basic Usage

grademerge.sh in=<file>

GradeMerge evaluates the quality of read merging by comparing merged synthetic reads against their known insert sizes. This tool is specifically designed for synthetic reads created with RandomReads that contain insert size information in their headers.

Parameters

GradeMerge accepts input files and optional raw reads for dual-stage merging analysis.

Input/Output Parameters

in=<file>: Specify the input file containing merged reads, or 'stdin'. The reads must have synthetic headers containing insert size information (e.g., "insert=250") for proper grading.
raw=<file>: Specify the original raw read pairs before merging. This allows calculation of what percentage of reads were theoretically mergeable. Use # symbol for paired files (raw=reads#.fq becomes reads1.fq and reads2.fq).
raw1=<file>: First file of raw paired reads. Use with raw2 to specify paired files explicitly.
raw2=<file>: Second file of raw paired reads. Used in conjunction with raw1.

Processing Parameters

verbose=f: Print additional processing information during execution. Set to true for detailed output about file processing stages.

Examples

Basic Merge Grading

grademerge.sh in=merged_reads.fq

Grades the correctness of merged reads by comparing actual merged length against the insert size embedded in synthetic read headers.

Dual-Stage Analysis with Raw Reads

grademerge.sh in=merged_reads.fq raw=raw_reads#.fq

Analyzes merged reads and also reports what percentage of the original raw read pairs were theoretically mergeable based on their insert sizes.

Verbose Processing

grademerge.sh in=merged_reads.fq raw1=reads_1.fq raw2=reads_2.fq verbose=t

Performs dual-stage analysis with detailed processing information printed to stderr.

Algorithm Details

Merge Quality Assessment

GradeMerge implements delta-based comparison logic to evaluate read merging accuracy using synthetic read headers containing known insert sizes:

Header Parsing Strategy

SYN Headers: Uses CustomHeader constructor to parse synthetic read headers created by RandomReads
Insert Headers: parseInsert() method extracts insert size from "insert=N" format headers using substring parsing
Character-by-Character Parsing: Tools.isDigit() validation handles header modifications by other tools while preserving insert size information

Quality Classification

Each merged read is classified based on the comparison between its actual length and the expected insert size:

Correct: Merged length exactly matches the insert size (delta = 0)
Too Long: Merged length exceeds the insert size (delta > 0)
Too Short: Merged length is less than the insert size (delta < 0)

Statistical Analysis

The tool computes statistics using counters for each classification category:

Accuracy Percentages: Fraction of reads in each category (correct/too short/too long) using Tools.format("%.5f") precision
Signal-to-Noise Ratio (SNR): Math.log10() calculation: 10 * log10((correct + incorrect + 0.0001) / (incorrect + 0.0001))
Mergeability Analysis: When raw reads provided, determines theoretical merge percentage using insert < pairLength() condition
Processing Throughput: Tools.timeReadsBasesProcessed() reports reads and bases processed per unit time

Dual Processing Strategy

When raw reads are provided, GradeMerge performs two-stage analysis using ConcurrentReadInputStream:

Raw Analysis: Determines which read pairs could theoretically be merged based on insert size < r1.pairLength() condition
Merged Analysis: Evaluates the quality of actually merged reads using delta=insert-initialLength1 calculation

Performance Characteristics

Memory Configuration: Fixed 200MB heap allocation (-Xmx200m) for predictable memory usage
Streaming Processing: Uses ListNum<Read> batching with ConcurrentReadInputStream for constant memory usage
Format Support: FileFormat.testInput() handles FASTQ and FASTA formats, compressed or uncompressed

Output Metrics

GradeMerge provides detailed statistics including:

Input total pairs and overlapping percentage (when raw reads provided)
Correct merge percentage and absolute counts
Error breakdown (too short vs too long)
Signal-to-noise ratio for merge quality assessment
Processing speed metrics (time, reads/second, bases/second)

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org