KapaStats
Gathers statistics on Kapa spike-in rates for contamination analysis in sequencing libraries. This specialized tool analyzes cross-contamination between wells in sequencing plates by examining Kapa adapter sequences.
Basic Usage
kapastats.sh in=<input file> out=<output file>
The input file should be a TSV format file containing plate IDs, with one plate ID per line. The tool will query web APIs to gather Kapa contamination data for each plate and generate statistical summaries.
Parameters
Parameters control input/output files, processing options, and output format for Kapa contamination analysis.
Input/Output Parameters
- in=<file>
- TSV file of plate IDs, one ID per line. Each plate ID will be used to query contamination data from web APIs. Lines starting with '#' are treated as comments and ignored.
- out=<file>
- Primary output file for contamination statistics. Can be set to stdout to print results to console. Output format depends on the 'raw' parameter setting.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default is false for safety.
- raw=f
- Output raw observations rather than statistical summaries. When true, outputs detailed per-well contamination data including plate names, well positions, tag sequences, and calculated contamination rates. When false (default), outputs statistical summaries with quartiles, averages, and standard deviations.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default is 200m for this lightweight tool.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines where graceful failure handling is important.
- -da
- Disable assertions. Can improve performance slightly in production environments where debugging information is not needed.
Examples
Basic Contamination Analysis
kapastats.sh in=plates.tsv out=contamination_stats.txt
Analyzes Kapa contamination for plates listed in plates.tsv and outputs statistical summaries to contamination_stats.txt. The output will include quartiles, averages, and standard deviations for contamination rates.
Raw Data Output
kapastats.sh in=plates.tsv out=raw_contamination.txt raw=t
Outputs detailed raw contamination data for each well, including plate names, well positions, tag sequences, read counts, and calculated parts-per-million contamination rates.
Console Output with Custom Memory
kapastats.sh in=plates.tsv out=stdout -Xmx1g
Prints contamination statistics to console with 1GB of allocated memory. Useful for interactive analysis or integration into larger pipelines.
Algorithm Details
Contamination Detection Strategy
KapaStats implements cross-contamination detection for sequencing libraries using ServerTools.readPage() web API integration and JsonParser for data processing:
- Web API Integration: Uses ServerTools.readPage() to query JGI's internal database APIs (https://rqc.jgi.doe.gov/api/plate_ui/page/[plateID]/kapa spikein) to retrieve Kapa adapter sequencing data for each plate, including well positions, tag sequences, and read counts.
- JSON Data Processing: Parses returned JSON data using JsonParser to extract kapa statistics including hit counts, offhit counts, and tag names for each well.
- Cross-Contamination Analysis: For each well, identifies the "correct" Kapa tag (correctKapaTag field) and measures reads from other tags that should not be present, indicating cross-contamination between wells.
- Parts-Per-Million Calculation: Calculates contamination rates as PPM using the formula: ppmk = ke.reads * (1000000.0/kapaReads) where ke.reads are contaminating reads and kapaReads are total Kapa reads.
- Genomic Contamination Inference: Estimates genomic DNA contamination using: greads = contamReads * (source.reads / source.correctKapaReads) and gppm = 1000000 * greads / sink.reads.
Statistical Analysis Methods
The tool provides statistical analysis of contamination patterns using Arrays.sort() and percentile calculations:
- Quartile Calculations: Uses array indexing for percentiles: p25 = ppmk[(int)Math.round((len-1)*0.25)], p50 = ppmk[(int)Math.round((len-1)*0.50)], p75 = ppmk[(int)Math.round((len-1)*0.75)]
- Summary Statistics: Computes minimum (ppmk[0]), maximum (ppmk[len-1]), average using shared.Vector.sum(ppmk)/len, and standard deviation using Tools.standardDeviation(ppmk)
- Observation Frequency: Tracks contamination event counts and calculates fraction as count/(double)len for each tag pair
- Dual Output Methods: printResults() for statistical summaries, printRawResults() for detailed per-observation data
Data Structures and Memory Management
Uses specific Java collections optimized for contamination analysis with 203-element initial capacity:
- LinkedHashMap Storage: tagMap (LinkedHashMap<String, TagData>) and plateMap (LinkedHashMap<String, Plate>) maintain insertion order with O(1) lookup performance
- Nested Class Hierarchy: Plate class contains ArrayList<Well> wells, Well class contains LinkedHashMap<String, KapaEntry> kapaMap modeling physical sequencing setup
- TagData Class Management: Contains LinkedHashMap<String, ArrayList<Double>> ppmMap and plateNameMap for tracking contamination observations across multiple plates
- Memory Configuration: Default 200MB allocation (-Xmx200m, -Xms200m) optimized for typical contamination analysis workloads
Output Format Implementation
The tool generates two distinct output formats using ByteBuilder for efficient string construction:
- Statistical Summary (raw=f): Tab-delimited format with ByteBuilder.append() calls: Tag, Other, Min, 25%, 50%, 75%, Max, Avg, Stdev, Observed, Total, Fraction
- Raw Observations (raw=t): Detailed format including: Plate, SinkWell, SinkCorrectTag, SinkReads, SinkCorrectKapaReads, SinkTotalKapaReads, SourceWell, MeasuredTag, SourceReads, SourceCorrectKapaReads, SourceKapaReadsInSink, KPPM, GReads, GPPM
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org