SummarizeQuast
Summarizes the output of multiple Quast reports for making box plots. This tool processes multiple Quast report TSV files and consolidates their metrics into a format suitable for statistical analysis and visualization.
Basic Usage
summarizequast.sh */quast/report.tsv
The tool accepts multiple Quast report files as input arguments. It reads the TSV format output from Quast and processes the metrics across all provided reports.
Parameters
Parameters control how the Quast reports are processed and summarized. The tool supports filtering, normalization, and output formatting options.
Parameters
- out=stdout
- Destination for summary output. Can be set to a filename to write results to a file instead of standard output. The output contains metric names followed by assembly statistics.
- required=
- A required substring in assembly names for filtering. Only assemblies whose names contain this substring will be included in the analysis. Leave empty to include all assemblies.
- normalize=t
- Normalize each metric to the average per report. When enabled, values for each metric are normalized by dividing by the average value across all assemblies for that metric. This helps compare relative performance across different scales.
- box=t
- Print only 5 points per metric for box plots. When enabled, outputs the 10th, 25th, 50th (median), 75th, and 90th percentiles for each metric instead of all individual values. This format is ideal for creating box plot visualizations.
Examples
Basic Summarization
summarizequast.sh project1/quast/report.tsv project2/quast/report.tsv project3/quast/report.tsv
Summarizes three Quast reports, outputting all values for each metric across all assemblies.
Box Plot Format
summarizequast.sh box=t out=summary.txt */quast/report.tsv
Processes all Quast reports in subdirectories, outputting 5-number summary statistics suitable for box plots to a file.
Filtered Analysis
summarizequast.sh required=spades normalize=f */quast/report.tsv
Analyzes only assemblies containing "spades" in their names, without normalization.
Normalized Comparison
summarizequast.sh normalize=t out=normalized_summary.txt */quast/report.tsv
Normalizes all metrics to their respective averages for relative comparison across assemblies of different scales.
Algorithm Details
Multi-Report Processing Architecture: SummarizeQuast implements a three-stage pipeline using nested LinkedHashMap data structures for QUAST report consolidation:
Stage 1: Report Parsing (QuastSummary.process)
Each QUAST TSV file undergoes header-driven parsing using TextFile.nextLine() iterative reading. The first line establishes String[] header array defining assembly column names. Subsequent lines create Entry objects via Double.parseDouble() with NumberFormatException handling for non-numeric values. Invalid data filtering removes NaN and infinite values using Double.isNaN() and Double.isInfinite() validation checks.
Stage 2: Data Consolidation (summarize method)
The tool constructs nested LinkedHashMap<String, LinkedHashMap<String, ArrayList<Double>>> metricMap structure where outer keys are metric names and inner keys are assembly names. LinkedHashMap provides insertion-order preservation for consistent output formatting. For each QuastSummary.metrics entry, values from matching assemblies across reports are aggregated into ArrayList<Double> collections using put() and add() operations.
Stage 3: Statistical Processing (print method)
Box plot mode converts ArrayList<Double> to double[] arrays using size() allocation and index-based copying. Statistical analysis applies Arrays.sort() for ordering followed by percentile extraction:
- Percentile Calculation: Index-based lookup using
array[(int)Math.round(percentile * (len-1))]
formula - Five-Number Summary: Extracts 10th (0.1*len), 25th (0.25*len), 50th (0.5*len), 75th (0.75*len), and 90th (0.9*len) percentiles
- Math.round() Implementation: Provides nearest-integer index calculation for percentile positions
Normalization Algorithm (normalize method)
Per-metric standardization operates on ArrayList<Entry> collections through QuastSummary.normalize():
- Sum Calculation: Iterative accumulation using for-each loop over Entry.value fields
- Average Computation: Arithmetic mean via sum/list.size() division
- Multiplicative Scaling: Each Entry.value *= (avg==0 ? 1 : 1/avg) transformation
- Zero-Division Protection: Conditional operator prevents division by zero with fallback multiplier of 1
Filtering Mechanism (required parameter)
Assembly name filtering operates during QuastSummary.process() parsing phase using String.contains(requiredString) boolean evaluation on header[i] assembly column names. This substring matching enables selective analysis by assembly type or naming convention.
Output Implementation (TextStreamWriter)
Results utilize TextStreamWriter for tab-delimited output generation. Metric names are printed via tsw.println(), followed by assembly-specific data using tsw.print() for tab separation. Box plot mode outputs five percentile values, while standard mode outputs all ArrayList<Double> values using for-each iteration.
Memory Management
Sequential file processing through Tools.getFileOrFiles() with individual QuastSummary instantiation maintains O(assemblies * metrics) memory complexity rather than O(total_data_volume). LinkedHashMap insertion-order preservation provides deterministic output without additional sorting overhead. TextStreamWriter.poisonAndWait() ensures proper resource cleanup and thread synchronization.
Input Format
SummarizeQuast expects Quast report files in TSV format with the following structure:
- Header Row: First column is "Assembly", followed by assembly names as column headers
- Metric Rows: Each row starts with a metric name, followed by values for each assembly
- Numeric Values: All metric values should be numeric; non-numeric values are ignored
- File Extensions: Tool accepts any file path; commonly used with report.tsv files from Quast
Output Format
The tool outputs organized metric summaries in the following format:
Standard Mode (box=f)
MetricName1
Assembly1 value1 value2 value3
Assembly2 value4 value5 value6
MetricName2
Assembly1 value7 value8 value9
Assembly2 value10 value11 value12
Box Plot Mode (box=t)
MetricName1
Assembly1 10th_percentile 25th_percentile median 75th_percentile 90th_percentile
Assembly2 10th_percentile 25th_percentile median 75th_percentile 90th_percentile
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org