SummarizeContam
Summarizes monthly contam files into a single file. This is for internal JGI use.
Basic Usage
summarizecontam.sh <input files> out=<output file>
This tool aggregates multiple contamination summary files from JGI into a single consolidated report. Input files can be specified using wildcards or comma-delimited lists, and the tool automatically processes taxonomic information when available.
Parameters
Parameters are organized by their function in the contamination summarization process. The tool processes multiple input files and consolidates contamination data with optional taxonomic classification.
Parameters
- in=<file,file>
- Input contam summary files, comma-delimited. Alternately, file arguments (from a * expansion) will be considered input files. Multiple files are merged into a single output.
- out=<file>
- Output file for the consolidated contamination summary. Contains tab-delimited data with taxonomy information.
- tree=auto
- Taxtree file location (optional). Used for taxonomic classification and to add size/sequence information to the output. When set to "auto", uses the default taxonomy tree location.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default is true.
Filter Parameters (passing all required to pass)
- minreads=0
- Ignore records with fewer reads than this threshold. Only records meeting both minreads and minsequnits criteria will be included in the output.
- minsequnits=0
- Ignore records with fewer seq units than this threshold. Combined with minreads filter to remove low-abundance contamination records.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions. Can improve performance in production environments.
Examples
Basic File Consolidation
summarizecontam.sh contam_jan.txt contam_feb.txt contam_mar.txt out=quarterly_summary.txt
Merges three monthly contamination files into a single quarterly summary report.
Wildcard Input with Filtering
summarizecontam.sh contam_*.txt out=annual_summary.txt minreads=100 minsequnits=50
Processes all contamination files matching the pattern, filtering out records with fewer than 100 reads or 50 sequence units.
Comma-delimited Input List
summarizecontam.sh in=file1.txt,file2.txt,file3.txt out=combined.txt tree=/path/to/taxtree.txt
Uses explicit input file specification with custom taxonomy tree for enhanced taxonomic classification.
Algorithm Details
SummarizeContamReport implements a contamination data aggregation system based on HashMap accumulation with taxonomic integration through TaxTree parsing. The implementation processes multiple contamination summary files through sequential file iteration and data consolidation.
Data Processing Implementation
The core aggregation uses a HashMap<String, StringLongLong>
collection where contamination records are keyed by taxonomic name. The processOneFile()
method reads contamination files line-by-line, parsing pipe-delimited entries using String.split("\\|")
. Records with matching taxonomic names are accumulated by summing their sequence unit (sll.a
) and read count (sll.b
) values through the StringLongLong
data structure.
Taxonomic Integration via TaxTree
When a taxonomy tree is loaded through TaxTree.loadTaxTree()
, the tool enhances each contamination record using several TaxTree methods:
- TaxID Resolution -
tree.parseNameToTaxid(sll.s)
converts taxonomic names to NCBI identifiers - Clade Classification -
tree.getNodeAtLevelExtended(tid, TaxTree.SUPERKINGDOM_E)
retrieves superkingdom-level classification - Size Metrics -
tree.toSize(tn)
andtree.toSizeC(tn)
provide direct and cumulative genome sizes - Sequence Counts -
tree.toSeqs(tn)
andtree.toSeqsC(tn)
extract direct and cumulative sequence tallies - Node Counts -
tree.toNodes(tn)
provides cumulative taxonomic node counts for hierarchical analysis
Filtering and Sorting Implementation
The tool applies dual filtering using boolean expressions (sll.a>=minSeqUnits && sll.b>=minReads)
where both minimum read count and minimum sequence unit thresholds must be satisfied. Results are sorted using ComparatorA
which implements descending order by sequence units (x.a<y.a ? 1 : -1
), then by read counts (x.b<y.b ? 1 : -1
), then lexicographically by taxonomic name (x.s.compareTo(y.s)
).
Output Format and TextStreamWriter
The output file is generated using TextStreamWriter
with tab-delimited format. The header line is hardcoded as:
#Name SeqUnits Reads TaxID Clade size cSize seqs cSeqs cNodes
Each data line is constructed through string concatenation: sll.s+"\t"+sll.a+"\t"+sll.b+"\t"+tid+"\t"+(ancestor==null ? "null" : ancestor.name)+"\t"+size+"\t"+cumulative_size+"\t"+seqs+"\t"+cumulative_seqs+"\t"+cumulative_nodes
, providing complete contamination metrics with taxonomic context for quantitative assessment.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org