SummarizeCrossblock

Basic Usage

summarizecrossblock.sh in=<input file> out=<output file>

This tool processes one or more CrossBlock result files and generates a summary table showing statistics for each input file including contig counts, base counts, and the number of contigs and bases that were discarded during CrossBlock processing.

Parameters

Parameters are organized by their function in the summarization process. All parameters from the shell script are documented below.

Standard parameters

in=<file>: A text file of files, or a comma-delimited list of files. Each is a path to results.txt output from CrossBlock. If providing a file of filenames, each line should contain one path. If providing a comma-delimited list, separate multiple result files with commas.
out=<file>: Output file for the summary. Will contain tab-delimited data with columns: filename, copies, contigs, contigsDiscarded, bases, basesDiscarded. If not specified, output goes to stdout.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default is false, meaning existing files will be protected. Set to true to allow overwriting.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default is 200m for this lightweight tool.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines where clean failure handling is important.
-da: Disable assertions. Can provide minor performance improvement in production environments where assertion checking is not needed.

Examples

Basic Summary of Single CrossBlock Result

summarizecrossblock.sh in=crossblock_results.txt out=summary.txt

Processes a single CrossBlock result file and writes the summary to summary.txt.

Summary of Multiple Result Files

summarizecrossblock.sh in=results1.txt,results2.txt,results3.txt out=combined_summary.txt

Processes multiple CrossBlock result files specified as a comma-delimited list and creates a combined summary.

Batch Processing with File List

echo -e "sample1_results.txt\nsample2_results.txt\nsample3_results.txt" > filelist.txt
summarizecrossblock.sh in=filelist.txt out=batch_summary.txt

Creates a file listing multiple CrossBlock result files, then processes them all to create a batch summary.

Output to Console

summarizecrossblock.sh in=crossblock_results.txt

Processes the result file and prints the summary to standard output for immediate viewing or piping to other tools.

Output Format

The output is a tab-delimited table with the following columns:

fname - The filename of the CrossBlock result file
copies - Sequential copy number (1, 2, 3, etc.)
contigs - Total number of contigs in the result file
contigsDiscarded - Number of contigs marked as discarded/removed
bases - Total number of bases in all contigs
basesDiscarded - Number of bases in discarded/removed contigs

If an error occurs processing a specific file, the output will show "ERROR" instead of numeric values for that file.

Algorithm Details

SummarizeCrossblock implements a straightforward aggregation algorithm for CrossBlock validation and testing:

Processing Strategy

File Input Handling - Supports both direct comma-separated file lists and file-of-filenames input modes
Result Parsing - Uses ParseCrossblockResults to process each individual CrossBlock result file
Data Extraction - Reads tab-delimited lines from CrossBlock output, parsing contig length and removal status
Statistics Calculation - Accumulates counts and base totals for both kept and discarded contigs
Error Handling - Gracefully handles corrupted or unreadable result files by marking them as "ERROR"

Input File Format Requirements

CrossBlock result files must contain tab-delimited data where:

Column 2 (index 1): Removal flag (0=kept, 1=discarded)
Column 3 (index 2): Contig length in bases
Lines starting with '#' are ignored as comments

Memory Usage

This tool has minimal memory requirements since it processes result files sequentially without storing large data structures. The default 200MB memory allocation is sufficient for most use cases, even when processing hundreds of result files.

Performance Characteristics

Processing Speed - Primarily I/O bound, processes files at disk read speed
Memory Efficiency - Constant memory usage regardless of input file size
Scalability - Can handle arbitrarily large numbers of result files
Error Tolerance - Continues processing remaining files even if some fail

Related Tools

This tool is part of the CrossBlock validation and testing workflow:

crossblock.sh - Generates the result files that this tool summarizes
decontaminate.sh - The main decontamination tool that CrossBlock helps optimize

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org