bbcountunique

Overview

bbcountunique generates kmer uniqueness histograms similar to rarefaction curves, helping determine library complexity by tracking how the fraction of unique reads changes as more reads are processed. The tool uses a probabilistic approach: if a kmer from a fixed location has been seen before, it assumes the entire read has been seen before.

Key Applications

Library Complexity Assessment: Determine if libraries have sufficient diversity
PCR Duplication Detection: Identify over-amplified libraries before analysis
Sequencing Depth Optimization: Find the point of diminishing returns for additional sequencing
Quality Control: Detect systematic biases and quality issues across reads

Basic Usage

bbcountunique.sh in=<input> out=<output>

Outputs a tab-delimited histogram with 3 columns for single reads or 6 columns for paired reads, showing uniqueness percentages at regular intervals.

Parameters

Input parameters

in2=null: Second input file for paired reads. When specified, calculates additional paired-end specific metrics including composite pair kmers.
interleaved=auto: Override autodetection of paired interleaved format. Set true/false to force interpretation of input format.
samplerate=1: Sample fraction of input reads (0-1). Values below 1 enable quick analysis of large datasets. For example, 0.1 processes 10% of reads.
reads=-1: Process only this number of reads before stopping (-1 processes all reads). Useful for testing or standardized comparisons.

Output parameters

out=<file>: Output file for uniqueness statistics. Tab-delimited format with column headers. Outputs to stdout if not specified.

Processing parameters

k=25: Kmer length (range 1-31). Default 25 provides good balance between specificity and sensitivity. Note: k=20 may be too short for some applications.
interval=25000: Report statistics every N reads. Controls output resolution - smaller values provide finer detail but larger output files.
cumulative=f: Show cumulative statistics rather than per-interval numbers. When true, displays running totals from file start instead of per-bin statistics.
percent=t: Display uniqueness as percentages (0-100). When false, shows proportional values (0-1).
count=f: Include raw counts of unique kmers in addition to percentages. Adds additional columns to output.
printlastbin=f: (plb) Include statistics for final undersized bin. When true, outputs data for remaining reads even if fewer than interval size.
minprob=0: Ignore kmers with correctness probability below this threshold (0-1, based on quality scores). Filters sequencing errors but may affect low-quality data analysis.

Java Parameters

-Xmx: Set Java memory usage. Tool automatically grabs available memory even when not needed. Typically requires 1-4GB for genome-scale datasets.
-eoom: Exit on out-of-memory exception. Requires Java 8u92+. Useful for batch processing to prevent hanging.
-da: Disable Java assertions for minor performance improvement in production runs.

Examples

Basic Library Complexity Assessment

bbcountunique.sh in=reads.fq out=complexity.txt

Standard analysis with default parameters. Good starting point for most applications.

Paired-End Analysis with Quality Filtering

bbcountunique.sh in=R1.fq in2=R2.fq out=uniqueness.txt minprob=0.9

Analyzes paired reads while filtering kmers with >10% error probability. Provides separate statistics for R1, R2, and combined pair metrics.

High-Resolution Monitoring

bbcountunique.sh in=reads.fq out=detailed.txt interval=5000 count=t percent=t

Finer resolution output (every 5K reads) with both counts and percentages for detailed analysis.

Quick Assessment of Large Dataset

bbcountunique.sh in=huge_dataset.fq out=sample.txt samplerate=0.1 reads=1000000

Analyzes 10% sample of first million reads for rapid complexity assessment.

Cumulative Analysis

bbcountunique.sh in=reads.fq out=cumulative.txt cumulative=t

Shows running totals instead of per-interval statistics, useful for plotting saturation curves.

Algorithm Details

Probabilistic Uniqueness Detection

The tool uses a probabilistic approach to determine read uniqueness by tracking kmers from fixed positions within reads. If a kmer from a specific location (e.g., first kmer) has been observed before, the algorithm assumes the entire read has been seen previously. This approach provides efficient uniqueness estimation without storing complete read sequences.

Multi-Table Hash Distribution

CalcUniqueness employs 31 parallel hash tables (WAYS=31) for kmer storage, distributing kmers based on hash values (kmer % 31). This parallel architecture reduces memory contention and enables efficient concurrent processing. The system uses HashArray1D data structures with automatic load balancing through ScheduleMaker optimization.

Kmer Extraction Strategies

Fixed Position: Extracts kmers from consistent locations (default: first kmer) for reproducible sampling
Random Position: Samples kmers from random locations within reads for statistical diversity
Paired Composite: Creates concatenated kmers from corresponding positions in R1 and R2 (offset=10) for paired-end analysis

Counter System Architecture

Maintains separate counters with distinct bitmasks for different kmer types:

bothCounterFirst/bothCounterRand (masks 32/64): Combined statistics across all reads
r1CounterFirst/r1CounterRand (masks 1/2): Read 1 specific metrics
r2CounterFirst/r2CounterRand (masks 4/8): Read 2 specific metrics
pairCounter (mask 16): Paired composite kmer tracking

Quality-Based Filtering

The minprob parameter enables quality-aware analysis by calculating per-kmer correctness probability from quality scores. Kmers with probability below the threshold are excluded from analysis, preventing sequencing errors from inflating uniqueness estimates. This is crucial for accurate analysis of lower-quality data.

Memory Management

The tool automatically allocates available system memory despite typically not requiring it all (approximately 50 bytes per unique read). Initial hash table capacity is set to 512,000 entries with dynamic expansion. Memory usage scales with dataset complexity rather than size.

Output Format and Interpretation

Column Structure

Single-end reads (3 columns):

count: Number of reads processed
first: Percent unique first kmers
rand: Percent unique random kmers

Paired-end reads (6 columns):

count: Number of pairs processed
r1_first: Percent unique first kmers from read 1
r1_rand: Percent unique random kmers from read 1
r2_first: Percent unique first kmers from read 2
r2_rand: Percent unique random kmers from read 2
pair: Percent unique concatenated pair kmers

Data Quality Considerations

Important: This tool requires exact kmer matches, making it sensitive to data quality:

Low-quality data will give artificially high uniqueness estimates
Cannot be used on raw PacBio data due to high error rates
Quality issues appear as spikes in the uniqueness plots, indicating flowcell problems
Use minprob parameter to filter error-prone kmers

Legacy Design Notes

bbcountunique was designed to replace an existing pipeline while maintaining compatible output. Some features reflect this legacy requirement rather than optimal design:

K=20 default may be too short for some modern applications
Random kmer columns are of questionable utility and can often be ignored
Output format matches historical expectations rather than current best practices

Applications and Best Practices

Library Complexity Assessment

High uniqueness (>80%): Good library complexity with minimal PCR duplication
Medium uniqueness (50-80%): Moderate complexity, acceptable for most applications
Low uniqueness (<50%): Over-amplification or insufficient library complexity

Sequencing Depth Optimization

Plot uniqueness vs. read count to identify the saturation point where additional sequencing yields diminishing returns. The slope of uniqueness decline indicates optimal stopping points for cost-effective sequencing.

Quality Control Applications

Systematic bias detection: Sudden drops in uniqueness may indicate adapter contamination
Flowcell quality assessment: Spikes in uniqueness indicate regional quality problems
PCR duplication monitoring: Low uniqueness early in sequencing suggests over-amplification

Comparative Analysis Guidelines

Use consistent parameters (k, interval, minprob) across samples
Consider data quality effects when comparing different sequencing platforms
Focus on first kmer columns for most analyses (random columns less informative)
Use paired metrics for paired-end data rather than individual read statistics

Performance and Limitations

Performance Characteristics

Time complexity: O(n) where n is number of input bases
Memory usage: Approximately 50 bytes per unique read
Typical memory: 1-4GB for whole genome datasets
Parallel efficiency: Good scaling with multiple CPU cores

Data Requirements and Limitations

Exact matching required: Single base differences treated as unique
Quality sensitivity: Low-quality data produces inflated estimates
Kmer length constraints: Limited to k=1-31 by implementation
Platform limitations: Not suitable for high-error platforms without preprocessing