bbcountunique

Script: bbcountunique.sh Package: jgi Class: CalcUniqueness.java

Creates rarefaction curves by plotting the fraction of unique reads as a function of reads sequenced. Uses probabilistic kmer tracking to assess library complexity and determine whether additional sequencing might be beneficial. Essential for quality control and sequencing depth optimization.

Overview

bbcountunique generates kmer uniqueness histograms similar to rarefaction curves, helping determine library complexity by tracking how the fraction of unique reads changes as more reads are processed. The tool uses a probabilistic approach: if a kmer from a fixed location has been seen before, it assumes the entire read has been seen before.

Key Applications

Basic Usage

bbcountunique.sh in=<input> out=<output>

Outputs a tab-delimited histogram with 3 columns for single reads or 6 columns for paired reads, showing uniqueness percentages at regular intervals.

Parameters

Input parameters

in2=null
Second input file for paired reads. When specified, calculates additional paired-end specific metrics including composite pair kmers.
interleaved=auto
Override autodetection of paired interleaved format. Set true/false to force interpretation of input format.
samplerate=1
Sample fraction of input reads (0-1). Values below 1 enable quick analysis of large datasets. For example, 0.1 processes 10% of reads.
reads=-1
Process only this number of reads before stopping (-1 processes all reads). Useful for testing or standardized comparisons.

Output parameters

out=<file>
Output file for uniqueness statistics. Tab-delimited format with column headers. Outputs to stdout if not specified.

Processing parameters

k=25
Kmer length (range 1-31). Default 25 provides good balance between specificity and sensitivity. Note: k=20 may be too short for some applications.
interval=25000
Report statistics every N reads. Controls output resolution - smaller values provide finer detail but larger output files.
cumulative=f
Show cumulative statistics rather than per-interval numbers. When true, displays running totals from file start instead of per-bin statistics.
percent=t
Display uniqueness as percentages (0-100). When false, shows proportional values (0-1).
count=f
Include raw counts of unique kmers in addition to percentages. Adds additional columns to output.
printlastbin=f
(plb) Include statistics for final undersized bin. When true, outputs data for remaining reads even if fewer than interval size.
minprob=0
Ignore kmers with correctness probability below this threshold (0-1, based on quality scores). Filters sequencing errors but may affect low-quality data analysis.

Java Parameters

-Xmx
Set Java memory usage. Tool automatically grabs available memory even when not needed. Typically requires 1-4GB for genome-scale datasets.
-eoom
Exit on out-of-memory exception. Requires Java 8u92+. Useful for batch processing to prevent hanging.
-da
Disable Java assertions for minor performance improvement in production runs.

Examples

Basic Library Complexity Assessment

bbcountunique.sh in=reads.fq out=complexity.txt

Standard analysis with default parameters. Good starting point for most applications.

Paired-End Analysis with Quality Filtering

bbcountunique.sh in=R1.fq in2=R2.fq out=uniqueness.txt minprob=0.9

Analyzes paired reads while filtering kmers with >10% error probability. Provides separate statistics for R1, R2, and combined pair metrics.

High-Resolution Monitoring

bbcountunique.sh in=reads.fq out=detailed.txt interval=5000 count=t percent=t

Finer resolution output (every 5K reads) with both counts and percentages for detailed analysis.

Quick Assessment of Large Dataset

bbcountunique.sh in=huge_dataset.fq out=sample.txt samplerate=0.1 reads=1000000

Analyzes 10% sample of first million reads for rapid complexity assessment.

Cumulative Analysis

bbcountunique.sh in=reads.fq out=cumulative.txt cumulative=t

Shows running totals instead of per-interval statistics, useful for plotting saturation curves.

Algorithm Details

Probabilistic Uniqueness Detection

The tool uses a probabilistic approach to determine read uniqueness by tracking kmers from fixed positions within reads. If a kmer from a specific location (e.g., first kmer) has been observed before, the algorithm assumes the entire read has been seen previously. This approach provides efficient uniqueness estimation without storing complete read sequences.

Multi-Table Hash Distribution

CalcUniqueness employs 31 parallel hash tables (WAYS=31) for kmer storage, distributing kmers based on hash values (kmer % 31). This parallel architecture reduces memory contention and enables efficient concurrent processing. The system uses HashArray1D data structures with automatic load balancing through ScheduleMaker optimization.

Kmer Extraction Strategies

Counter System Architecture

Maintains separate counters with distinct bitmasks for different kmer types:

Quality-Based Filtering

The minprob parameter enables quality-aware analysis by calculating per-kmer correctness probability from quality scores. Kmers with probability below the threshold are excluded from analysis, preventing sequencing errors from inflating uniqueness estimates. This is crucial for accurate analysis of lower-quality data.

Memory Management

The tool automatically allocates available system memory despite typically not requiring it all (approximately 50 bytes per unique read). Initial hash table capacity is set to 512,000 entries with dynamic expansion. Memory usage scales with dataset complexity rather than size.

Output Format and Interpretation

Column Structure

Single-end reads (3 columns):

Paired-end reads (6 columns):

Data Quality Considerations

Important: This tool requires exact kmer matches, making it sensitive to data quality:

  • Low-quality data will give artificially high uniqueness estimates
  • Cannot be used on raw PacBio data due to high error rates
  • Quality issues appear as spikes in the uniqueness plots, indicating flowcell problems
  • Use minprob parameter to filter error-prone kmers

Legacy Design Notes

bbcountunique was designed to replace an existing pipeline while maintaining compatible output. Some features reflect this legacy requirement rather than optimal design:

Applications and Best Practices

Library Complexity Assessment

Sequencing Depth Optimization

Plot uniqueness vs. read count to identify the saturation point where additional sequencing yields diminishing returns. The slope of uniqueness decline indicates optimal stopping points for cost-effective sequencing.

Quality Control Applications

Comparative Analysis Guidelines

Performance and Limitations

Performance Characteristics

Data Requirements and Limitations