SummarizeQuast

Basic Usage

summarizequast.sh */quast/report.tsv

The tool accepts multiple Quast report files as input arguments. It reads the TSV format output from Quast and processes the metrics across all provided reports.

Parameters

Parameters control how the Quast reports are processed and summarized. The tool supports filtering, normalization, and output formatting options.

Parameters

out=stdout: Destination for summary output. Can be set to a filename to write results to a file instead of standard output. The output contains metric names followed by assembly statistics.
required=: A required substring in assembly names for filtering. Only assemblies whose names contain this substring will be included in the analysis. Leave empty to include all assemblies.
normalize=t: Normalize each metric to the average per report. When enabled, values for each metric are normalized by dividing by the average value across all assemblies for that metric. This helps compare relative performance across different scales.
box=t: Print only 5 points per metric for box plots. When enabled, outputs the 10th, 25th, 50th (median), 75th, and 90th percentiles for each metric instead of all individual values. This format is ideal for creating box plot visualizations.

Examples

Basic Summarization

summarizequast.sh project1/quast/report.tsv project2/quast/report.tsv project3/quast/report.tsv

Summarizes three Quast reports, outputting all values for each metric across all assemblies.

Box Plot Format

summarizequast.sh box=t out=summary.txt */quast/report.tsv

Processes all Quast reports in subdirectories, outputting 5-number summary statistics suitable for box plots to a file.

Filtered Analysis

summarizequast.sh required=spades normalize=f */quast/report.tsv

Analyzes only assemblies containing "spades" in their names, without normalization.

Normalized Comparison

summarizequast.sh normalize=t out=normalized_summary.txt */quast/report.tsv

Normalizes all metrics to their respective averages for relative comparison across assemblies of different scales.

Algorithm Details

Multi-Report Processing Architecture: SummarizeQuast implements a three-stage pipeline using nested LinkedHashMap data structures for QUAST report consolidation:

Stage 1: Report Parsing (QuastSummary.process)

Each QUAST TSV file undergoes header-driven parsing using TextFile.nextLine() iterative reading. The first line establishes String[] header array defining assembly column names. Subsequent lines create Entry objects via Double.parseDouble() with NumberFormatException handling for non-numeric values. Invalid data filtering removes NaN and infinite values using Double.isNaN() and Double.isInfinite() validation checks.

Stage 2: Data Consolidation (summarize method)

The tool constructs nested LinkedHashMap<String, LinkedHashMap<String, ArrayList<Double>>> metricMap structure where outer keys are metric names and inner keys are assembly names. LinkedHashMap provides insertion-order preservation for consistent output formatting. For each QuastSummary.metrics entry, values from matching assemblies across reports are aggregated into ArrayList<Double> collections using put() and add() operations.

Stage 3: Statistical Processing (print method)

Box plot mode converts ArrayList<Double> to double[] arrays using size() allocation and index-based copying. Statistical analysis applies Arrays.sort() for ordering followed by percentile extraction:

Percentile Calculation: Index-based lookup using array[(int)Math.round(percentile * (len-1))] formula
Five-Number Summary: Extracts 10th (0.1*len), 25th (0.25*len), 50th (0.5*len), 75th (0.75*len), and 90th (0.9*len) percentiles
Math.round() Implementation: Provides nearest-integer index calculation for percentile positions

Normalization Algorithm (normalize method)

Per-metric standardization operates on ArrayList<Entry> collections through QuastSummary.normalize():

Sum Calculation: Iterative accumulation using for-each loop over Entry.value fields
Average Computation: Arithmetic mean via sum/list.size() division
Multiplicative Scaling: Each Entry.value *= (avg==0 ? 1 : 1/avg) transformation
Zero-Division Protection: Conditional operator prevents division by zero with fallback multiplier of 1

Filtering Mechanism (required parameter)

Assembly name filtering operates during QuastSummary.process() parsing phase using String.contains(requiredString) boolean evaluation on header[i] assembly column names. This substring matching enables selective analysis by assembly type or naming convention.

Output Implementation (TextStreamWriter)

Results utilize TextStreamWriter for tab-delimited output generation. Metric names are printed via tsw.println(), followed by assembly-specific data using tsw.print() for tab separation. Box plot mode outputs five percentile values, while standard mode outputs all ArrayList<Double> values using for-each iteration.

Memory Management

Sequential file processing through Tools.getFileOrFiles() with individual QuastSummary instantiation maintains O(assemblies * metrics) memory complexity rather than O(total_data_volume). LinkedHashMap insertion-order preservation provides deterministic output without additional sorting overhead. TextStreamWriter.poisonAndWait() ensures proper resource cleanup and thread synchronization.

Input Format

SummarizeQuast expects Quast report files in TSV format with the following structure:

Header Row: First column is "Assembly", followed by assembly names as column headers
Metric Rows: Each row starts with a metric name, followed by values for each assembly
Numeric Values: All metric values should be numeric; non-numeric values are ignored
File Extensions: Tool accepts any file path; commonly used with report.tsv files from Quast

Output Format

The tool outputs organized metric summaries in the following format:

Standard Mode (box=f)


MetricName1
Assembly1    value1    value2    value3
Assembly2    value4    value5    value6

MetricName2
Assembly1    value7    value8    value9
Assembly2    value10   value11   value12

Box Plot Mode (box=t)


MetricName1
Assembly1    10th_percentile    25th_percentile    median    75th_percentile    90th_percentile
Assembly2    10th_percentile    25th_percentile    median    75th_percentile    90th_percentile

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org