bbcountunique
Creates rarefaction curves by plotting the fraction of unique reads as a function of reads sequenced. Uses probabilistic kmer tracking to assess library complexity and determine whether additional sequencing might be beneficial. Essential for quality control and sequencing depth optimization.
Overview
bbcountunique generates kmer uniqueness histograms similar to rarefaction curves, helping determine library complexity by tracking how the fraction of unique reads changes as more reads are processed. The tool uses a probabilistic approach: if a kmer from a fixed location has been seen before, it assumes the entire read has been seen before.
Key Applications
- Library Complexity Assessment: Determine if libraries have sufficient diversity
- PCR Duplication Detection: Identify over-amplified libraries before analysis
- Sequencing Depth Optimization: Find the point of diminishing returns for additional sequencing
- Quality Control: Detect systematic biases and quality issues across reads
Basic Usage
bbcountunique.sh in=<input> out=<output>
Outputs a tab-delimited histogram with 3 columns for single reads or 6 columns for paired reads, showing uniqueness percentages at regular intervals.
Parameters
Input parameters
- in2=null
- Second input file for paired reads. When specified, calculates additional paired-end specific metrics including composite pair kmers.
- interleaved=auto
- Override autodetection of paired interleaved format. Set true/false to force interpretation of input format.
- samplerate=1
- Sample fraction of input reads (0-1). Values below 1 enable quick analysis of large datasets. For example, 0.1 processes 10% of reads.
- reads=-1
- Process only this number of reads before stopping (-1 processes all reads). Useful for testing or standardized comparisons.
Output parameters
- out=<file>
- Output file for uniqueness statistics. Tab-delimited format with column headers. Outputs to stdout if not specified.
Processing parameters
- k=25
- Kmer length (range 1-31). Default 25 provides good balance between specificity and sensitivity. Note: k=20 may be too short for some applications.
- interval=25000
- Report statistics every N reads. Controls output resolution - smaller values provide finer detail but larger output files.
- cumulative=f
- Show cumulative statistics rather than per-interval numbers. When true, displays running totals from file start instead of per-bin statistics.
- percent=t
- Display uniqueness as percentages (0-100). When false, shows proportional values (0-1).
- count=f
- Include raw counts of unique kmers in addition to percentages. Adds additional columns to output.
- printlastbin=f
- (plb) Include statistics for final undersized bin. When true, outputs data for remaining reads even if fewer than interval size.
- minprob=0
- Ignore kmers with correctness probability below this threshold (0-1, based on quality scores). Filters sequencing errors but may affect low-quality data analysis.
Java Parameters
- -Xmx
- Set Java memory usage. Tool automatically grabs available memory even when not needed. Typically requires 1-4GB for genome-scale datasets.
- -eoom
- Exit on out-of-memory exception. Requires Java 8u92+. Useful for batch processing to prevent hanging.
- -da
- Disable Java assertions for minor performance improvement in production runs.
Examples
Basic Library Complexity Assessment
bbcountunique.sh in=reads.fq out=complexity.txt
Standard analysis with default parameters. Good starting point for most applications.
Paired-End Analysis with Quality Filtering
bbcountunique.sh in=R1.fq in2=R2.fq out=uniqueness.txt minprob=0.9
Analyzes paired reads while filtering kmers with >10% error probability. Provides separate statistics for R1, R2, and combined pair metrics.
High-Resolution Monitoring
bbcountunique.sh in=reads.fq out=detailed.txt interval=5000 count=t percent=t
Finer resolution output (every 5K reads) with both counts and percentages for detailed analysis.
Quick Assessment of Large Dataset
bbcountunique.sh in=huge_dataset.fq out=sample.txt samplerate=0.1 reads=1000000
Analyzes 10% sample of first million reads for rapid complexity assessment.
Cumulative Analysis
bbcountunique.sh in=reads.fq out=cumulative.txt cumulative=t
Shows running totals instead of per-interval statistics, useful for plotting saturation curves.
Algorithm Details
Probabilistic Uniqueness Detection
The tool uses a probabilistic approach to determine read uniqueness by tracking kmers from fixed positions within reads. If a kmer from a specific location (e.g., first kmer) has been observed before, the algorithm assumes the entire read has been seen previously. This approach provides efficient uniqueness estimation without storing complete read sequences.
Multi-Table Hash Distribution
CalcUniqueness employs 31 parallel hash tables (WAYS=31) for kmer storage, distributing kmers based on hash values (kmer % 31). This parallel architecture reduces memory contention and enables efficient concurrent processing. The system uses HashArray1D data structures with automatic load balancing through ScheduleMaker optimization.
Kmer Extraction Strategies
- Fixed Position: Extracts kmers from consistent locations (default: first kmer) for reproducible sampling
- Random Position: Samples kmers from random locations within reads for statistical diversity
- Paired Composite: Creates concatenated kmers from corresponding positions in R1 and R2 (offset=10) for paired-end analysis
Counter System Architecture
Maintains separate counters with distinct bitmasks for different kmer types:
- bothCounterFirst/bothCounterRand (masks 32/64): Combined statistics across all reads
- r1CounterFirst/r1CounterRand (masks 1/2): Read 1 specific metrics
- r2CounterFirst/r2CounterRand (masks 4/8): Read 2 specific metrics
- pairCounter (mask 16): Paired composite kmer tracking
Quality-Based Filtering
The minprob parameter enables quality-aware analysis by calculating per-kmer correctness probability from quality scores. Kmers with probability below the threshold are excluded from analysis, preventing sequencing errors from inflating uniqueness estimates. This is crucial for accurate analysis of lower-quality data.
Memory Management
The tool automatically allocates available system memory despite typically not requiring it all (approximately 50 bytes per unique read). Initial hash table capacity is set to 512,000 entries with dynamic expansion. Memory usage scales with dataset complexity rather than size.
Output Format and Interpretation
Column Structure
Single-end reads (3 columns):
- count: Number of reads processed
- first: Percent unique first kmers
- rand: Percent unique random kmers
Paired-end reads (6 columns):
- count: Number of pairs processed
- r1_first: Percent unique first kmers from read 1
- r1_rand: Percent unique random kmers from read 1
- r2_first: Percent unique first kmers from read 2
- r2_rand: Percent unique random kmers from read 2
- pair: Percent unique concatenated pair kmers
Data Quality Considerations
Important: This tool requires exact kmer matches, making it sensitive to data quality:
- Low-quality data will give artificially high uniqueness estimates
- Cannot be used on raw PacBio data due to high error rates
- Quality issues appear as spikes in the uniqueness plots, indicating flowcell problems
- Use minprob parameter to filter error-prone kmers
Legacy Design Notes
bbcountunique was designed to replace an existing pipeline while maintaining compatible output. Some features reflect this legacy requirement rather than optimal design:
- K=20 default may be too short for some modern applications
- Random kmer columns are of questionable utility and can often be ignored
- Output format matches historical expectations rather than current best practices
Applications and Best Practices
Library Complexity Assessment
- High uniqueness (>80%): Good library complexity with minimal PCR duplication
- Medium uniqueness (50-80%): Moderate complexity, acceptable for most applications
- Low uniqueness (<50%): Over-amplification or insufficient library complexity
Sequencing Depth Optimization
Plot uniqueness vs. read count to identify the saturation point where additional sequencing yields diminishing returns. The slope of uniqueness decline indicates optimal stopping points for cost-effective sequencing.
Quality Control Applications
- Systematic bias detection: Sudden drops in uniqueness may indicate adapter contamination
- Flowcell quality assessment: Spikes in uniqueness indicate regional quality problems
- PCR duplication monitoring: Low uniqueness early in sequencing suggests over-amplification
Comparative Analysis Guidelines
- Use consistent parameters (k, interval, minprob) across samples
- Consider data quality effects when comparing different sequencing platforms
- Focus on first kmer columns for most analyses (random columns less informative)
- Use paired metrics for paired-end data rather than individual read statistics
Performance and Limitations
Performance Characteristics
- Time complexity: O(n) where n is number of input bases
- Memory usage: Approximately 50 bytes per unique read
- Typical memory: 1-4GB for whole genome datasets
- Parallel efficiency: Good scaling with multiple CPU cores
Data Requirements and Limitations
- Exact matching required: Single base differences treated as unique
- Quality sensitivity: Low-quality data produces inflated estimates
- Kmer length constraints: Limited to k=1-31 by implementation
- Platform limitations: Not suitable for high-error platforms without preprocessing