SummarizeCrossblock
Summarizes CrossBlock results. Used for testing and validating CrossBlock by aggregating statistics from multiple result files into a single summary table.
Basic Usage
summarizecrossblock.sh in=<input file> out=<output file>
This tool processes one or more CrossBlock result files and generates a summary table showing statistics for each input file including contig counts, base counts, and the number of contigs and bases that were discarded during CrossBlock processing.
Parameters
Parameters are organized by their function in the summarization process. All parameters from the shell script are documented below.
Standard parameters
- in=<file>
- A text file of files, or a comma-delimited list of files. Each is a path to results.txt output from CrossBlock. If providing a file of filenames, each line should contain one path. If providing a comma-delimited list, separate multiple result files with commas.
- out=<file>
- Output file for the summary. Will contain tab-delimited data with columns: filename, copies, contigs, contigsDiscarded, bases, basesDiscarded. If not specified, output goes to stdout.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default is false, meaning existing files will be protected. Set to true to allow overwriting.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default is 200m for this lightweight tool.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines where clean failure handling is important.
- -da
- Disable assertions. Can provide minor performance improvement in production environments where assertion checking is not needed.
Examples
Basic Summary of Single CrossBlock Result
summarizecrossblock.sh in=crossblock_results.txt out=summary.txt
Processes a single CrossBlock result file and writes the summary to summary.txt.
Summary of Multiple Result Files
summarizecrossblock.sh in=results1.txt,results2.txt,results3.txt out=combined_summary.txt
Processes multiple CrossBlock result files specified as a comma-delimited list and creates a combined summary.
Batch Processing with File List
echo -e "sample1_results.txt\nsample2_results.txt\nsample3_results.txt" > filelist.txt
summarizecrossblock.sh in=filelist.txt out=batch_summary.txt
Creates a file listing multiple CrossBlock result files, then processes them all to create a batch summary.
Output to Console
summarizecrossblock.sh in=crossblock_results.txt
Processes the result file and prints the summary to standard output for immediate viewing or piping to other tools.
Output Format
The output is a tab-delimited table with the following columns:
- fname - The filename of the CrossBlock result file
- copies - Sequential copy number (1, 2, 3, etc.)
- contigs - Total number of contigs in the result file
- contigsDiscarded - Number of contigs marked as discarded/removed
- bases - Total number of bases in all contigs
- basesDiscarded - Number of bases in discarded/removed contigs
If an error occurs processing a specific file, the output will show "ERROR" instead of numeric values for that file.
Algorithm Details
SummarizeCrossblock implements a straightforward aggregation algorithm for CrossBlock validation and testing:
Processing Strategy
- File Input Handling - Supports both direct comma-separated file lists and file-of-filenames input modes
- Result Parsing - Uses ParseCrossblockResults to process each individual CrossBlock result file
- Data Extraction - Reads tab-delimited lines from CrossBlock output, parsing contig length and removal status
- Statistics Calculation - Accumulates counts and base totals for both kept and discarded contigs
- Error Handling - Gracefully handles corrupted or unreadable result files by marking them as "ERROR"
Input File Format Requirements
CrossBlock result files must contain tab-delimited data where:
- Column 2 (index 1): Removal flag (0=kept, 1=discarded)
- Column 3 (index 2): Contig length in bases
- Lines starting with '#' are ignored as comments
Memory Usage
This tool has minimal memory requirements since it processes result files sequentially without storing large data structures. The default 200MB memory allocation is sufficient for most use cases, even when processing hundreds of result files.
Performance Characteristics
- Processing Speed - Primarily I/O bound, processes files at disk read speed
- Memory Efficiency - Constant memory usage regardless of input file size
- Scalability - Can handle arbitrarily large numbers of result files
- Error Tolerance - Continues processing remaining files even if some fail
Related Tools
This tool is part of the CrossBlock validation and testing workflow:
- crossblock.sh - Generates the result files that this tool summarizes
- decontaminate.sh - The main decontamination tool that CrossBlock helps optimize
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org