KapaStats

Script: kapastats.sh Package: jgi Class: GatherKapaStats.java

Gathers statistics on Kapa spike-in rates for contamination analysis in sequencing libraries. This specialized tool analyzes cross-contamination between wells in sequencing plates by examining Kapa adapter sequences.

Basic Usage

kapastats.sh in=<input file> out=<output file>

The input file should be a TSV format file containing plate IDs, with one plate ID per line. The tool will query web APIs to gather Kapa contamination data for each plate and generate statistical summaries.

Parameters

Parameters control input/output files, processing options, and output format for Kapa contamination analysis.

Input/Output Parameters

in=<file>
TSV file of plate IDs, one ID per line. Each plate ID will be used to query contamination data from web APIs. Lines starting with '#' are treated as comments and ignored.
out=<file>
Primary output file for contamination statistics. Can be set to stdout to print results to console. Output format depends on the 'raw' parameter setting.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default is false for safety.
raw=f
Output raw observations rather than statistical summaries. When true, outputs detailed per-well contamination data including plate names, well positions, tag sequences, and calculated contamination rates. When false (default), outputs statistical summaries with quartiles, averages, and standard deviations.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default is 200m for this lightweight tool.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines where graceful failure handling is important.
-da
Disable assertions. Can improve performance slightly in production environments where debugging information is not needed.

Examples

Basic Contamination Analysis

kapastats.sh in=plates.tsv out=contamination_stats.txt

Analyzes Kapa contamination for plates listed in plates.tsv and outputs statistical summaries to contamination_stats.txt. The output will include quartiles, averages, and standard deviations for contamination rates.

Raw Data Output

kapastats.sh in=plates.tsv out=raw_contamination.txt raw=t

Outputs detailed raw contamination data for each well, including plate names, well positions, tag sequences, read counts, and calculated parts-per-million contamination rates.

Console Output with Custom Memory

kapastats.sh in=plates.tsv out=stdout -Xmx1g

Prints contamination statistics to console with 1GB of allocated memory. Useful for interactive analysis or integration into larger pipelines.

Algorithm Details

Contamination Detection Strategy

KapaStats implements cross-contamination detection for sequencing libraries using ServerTools.readPage() web API integration and JsonParser for data processing:

Statistical Analysis Methods

The tool provides statistical analysis of contamination patterns using Arrays.sort() and percentile calculations:

Data Structures and Memory Management

Uses specific Java collections optimized for contamination analysis with 203-element initial capacity:

Output Format Implementation

The tool generates two distinct output formats using ByteBuilder for efficient string construction:

Support

For questions and support: