SummarizeContam

Basic Usage

summarizecontam.sh <input files> out=<output file>

This tool aggregates multiple contamination summary files from JGI into a single consolidated report. Input files can be specified using wildcards or comma-delimited lists, and the tool automatically processes taxonomic information when available.

Parameters

Parameters are organized by their function in the contamination summarization process. The tool processes multiple input files and consolidates contamination data with optional taxonomic classification.

Parameters

in=<file,file>: Input contam summary files, comma-delimited. Alternately, file arguments (from a * expansion) will be considered input files. Multiple files are merged into a single output.
out=<file>: Output file for the consolidated contamination summary. Contains tab-delimited data with taxonomy information.
tree=auto: Taxtree file location (optional). Used for taxonomic classification and to add size/sequence information to the output. When set to "auto", uses the default taxonomy tree location.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default is true.

Filter Parameters (passing all required to pass)

minreads=0: Ignore records with fewer reads than this threshold. Only records meeting both minreads and minsequnits criteria will be included in the output.
minsequnits=0: Ignore records with fewer seq units than this threshold. Combined with minreads filter to remove low-abundance contamination records.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions. Can improve performance in production environments.

Examples

Basic File Consolidation

summarizecontam.sh contam_jan.txt contam_feb.txt contam_mar.txt out=quarterly_summary.txt

Merges three monthly contamination files into a single quarterly summary report.

Wildcard Input with Filtering

summarizecontam.sh contam_*.txt out=annual_summary.txt minreads=100 minsequnits=50

Processes all contamination files matching the pattern, filtering out records with fewer than 100 reads or 50 sequence units.

Comma-delimited Input List

summarizecontam.sh in=file1.txt,file2.txt,file3.txt out=combined.txt tree=/path/to/taxtree.txt

Uses explicit input file specification with custom taxonomy tree for enhanced taxonomic classification.

Algorithm Details

SummarizeContamReport implements a contamination data aggregation system based on HashMap accumulation with taxonomic integration through TaxTree parsing. The implementation processes multiple contamination summary files through sequential file iteration and data consolidation.

Data Processing Implementation

The core aggregation uses a HashMap<String, StringLongLong> collection where contamination records are keyed by taxonomic name. The processOneFile() method reads contamination files line-by-line, parsing pipe-delimited entries using String.split("\\|"). Records with matching taxonomic names are accumulated by summing their sequence unit (sll.a) and read count (sll.b) values through the StringLongLong data structure.

Taxonomic Integration via TaxTree

When a taxonomy tree is loaded through TaxTree.loadTaxTree(), the tool enhances each contamination record using several TaxTree methods:

TaxID Resolution - tree.parseNameToTaxid(sll.s) converts taxonomic names to NCBI identifiers
Clade Classification - tree.getNodeAtLevelExtended(tid, TaxTree.SUPERKINGDOM_E) retrieves superkingdom-level classification
Size Metrics - tree.toSize(tn) and tree.toSizeC(tn) provide direct and cumulative genome sizes
Sequence Counts - tree.toSeqs(tn) and tree.toSeqsC(tn) extract direct and cumulative sequence tallies
Node Counts - tree.toNodes(tn) provides cumulative taxonomic node counts for hierarchical analysis

Filtering and Sorting Implementation

The tool applies dual filtering using boolean expressions (sll.a>=minSeqUnits && sll.b>=minReads) where both minimum read count and minimum sequence unit thresholds must be satisfied. Results are sorted using ComparatorA which implements descending order by sequence units (x.a<y.a ? 1 : -1), then by read counts (x.b<y.b ? 1 : -1), then lexicographically by taxonomic name (x.s.compareTo(y.s)).

Output Format and TextStreamWriter

The output file is generated using TextStreamWriter with tab-delimited format. The header line is hardcoded as:

#Name    SeqUnits    Reads    TaxID    Clade    size    cSize    seqs    cSeqs    cNodes

Each data line is constructed through string concatenation: sll.s+"\t"+sll.a+"\t"+sll.b+"\t"+tid+"\t"+(ancestor==null ? "null" : ancestor.name)+"\t"+size+"\t"+cumulative_size+"\t"+seqs+"\t"+cumulative_seqs+"\t"+cumulative_nodes, providing complete contamination metrics with taxonomic context for quantitative assessment.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org