KapaStats

Basic Usage

kapastats.sh in=<input file> out=<output file>

The input file should be a TSV format file containing plate IDs, with one plate ID per line. The tool will query web APIs to gather Kapa contamination data for each plate and generate statistical summaries.

Parameters

Parameters control input/output files, processing options, and output format for Kapa contamination analysis.

Input/Output Parameters

in=<file>: TSV file of plate IDs, one ID per line. Each plate ID will be used to query contamination data from web APIs. Lines starting with '#' are treated as comments and ignored.
out=<file>: Primary output file for contamination statistics. Can be set to stdout to print results to console. Output format depends on the 'raw' parameter setting.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default is false for safety.
raw=f: Output raw observations rather than statistical summaries. When true, outputs detailed per-well contamination data including plate names, well positions, tag sequences, and calculated contamination rates. When false (default), outputs statistical summaries with quartiles, averages, and standard deviations.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default is 200m for this lightweight tool.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines where graceful failure handling is important.
-da: Disable assertions. Can improve performance slightly in production environments where debugging information is not needed.

Examples

Basic Contamination Analysis

kapastats.sh in=plates.tsv out=contamination_stats.txt

Analyzes Kapa contamination for plates listed in plates.tsv and outputs statistical summaries to contamination_stats.txt. The output will include quartiles, averages, and standard deviations for contamination rates.

Raw Data Output

kapastats.sh in=plates.tsv out=raw_contamination.txt raw=t

Outputs detailed raw contamination data for each well, including plate names, well positions, tag sequences, read counts, and calculated parts-per-million contamination rates.

Console Output with Custom Memory

kapastats.sh in=plates.tsv out=stdout -Xmx1g

Prints contamination statistics to console with 1GB of allocated memory. Useful for interactive analysis or integration into larger pipelines.

Algorithm Details

Contamination Detection Strategy

KapaStats implements cross-contamination detection for sequencing libraries using ServerTools.readPage() web API integration and JsonParser for data processing:

Web API Integration: Uses ServerTools.readPage() to query JGI's internal database APIs (https://rqc.jgi.doe.gov/api/plate_ui/page/[plateID]/kapa spikein) to retrieve Kapa adapter sequencing data for each plate, including well positions, tag sequences, and read counts.
JSON Data Processing: Parses returned JSON data using JsonParser to extract kapa statistics including hit counts, offhit counts, and tag names for each well.
Cross-Contamination Analysis: For each well, identifies the "correct" Kapa tag (correctKapaTag field) and measures reads from other tags that should not be present, indicating cross-contamination between wells.
Parts-Per-Million Calculation: Calculates contamination rates as PPM using the formula: ppmk = ke.reads * (1000000.0/kapaReads) where ke.reads are contaminating reads and kapaReads are total Kapa reads.
Genomic Contamination Inference: Estimates genomic DNA contamination using: greads = contamReads * (source.reads / source.correctKapaReads) and gppm = 1000000 * greads / sink.reads.

Statistical Analysis Methods

The tool provides statistical analysis of contamination patterns using Arrays.sort() and percentile calculations:

Quartile Calculations: Uses array indexing for percentiles: p25 = ppmk[(int)Math.round((len-1)*0.25)], p50 = ppmk[(int)Math.round((len-1)*0.50)], p75 = ppmk[(int)Math.round((len-1)*0.75)]
Summary Statistics: Computes minimum (ppmk[0]), maximum (ppmk[len-1]), average using shared.Vector.sum(ppmk)/len, and standard deviation using Tools.standardDeviation(ppmk)
Observation Frequency: Tracks contamination event counts and calculates fraction as count/(double)len for each tag pair
Dual Output Methods: printResults() for statistical summaries, printRawResults() for detailed per-observation data

Data Structures and Memory Management

Uses specific Java collections optimized for contamination analysis with 203-element initial capacity:

LinkedHashMap Storage: tagMap (LinkedHashMap<String, TagData>) and plateMap (LinkedHashMap<String, Plate>) maintain insertion order with O(1) lookup performance
Nested Class Hierarchy: Plate class contains ArrayList<Well> wells, Well class contains LinkedHashMap<String, KapaEntry> kapaMap modeling physical sequencing setup
TagData Class Management: Contains LinkedHashMap<String, ArrayList<Double>> ppmMap and plateNameMap for tracking contamination observations across multiple plates
Memory Configuration: Default 200MB allocation (-Xmx200m, -Xms200m) optimized for typical contamination analysis workloads

Output Format Implementation

The tool generates two distinct output formats using ByteBuilder for efficient string construction:

Statistical Summary (raw=f): Tab-delimited format with ByteBuilder.append() calls: Tag, Other, Min, 25%, 50%, 75%, Max, Avg, Stdev, Observed, Total, Fraction
Raw Observations (raw=t): Detailed format including: Plate, SinkWell, SinkCorrectTag, SinkReads, SinkCorrectKapaReads, SinkTotalKapaReads, SourceWell, MeasuredTag, SourceReads, SourceCorrectKapaReads, SourceKapaReadsInSink, KPPM, GReads, GPPM

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org