LogLog
Estimates cardinality of unique kmers in sequence data using the LogLog algorithm. Uses Tools.hash64shift() function to map kmers into buckets, tracking maximum leading zeros for cardinality estimation and supports multi-threaded processing.
Basic Usage
loglog.sh in=<file> k=<31>
The LogLog algorithm estimates unique kmer cardinality using O(log log n) space complexity where n is the cardinality. Uses CardinalityTracker.makeTracker() to create tracker instances with Tools.hash64shift() for uniform hash distribution across buckets.
Parameters
Parameters control the LogLog algorithm's precision, input handling, and output options.
Input Parameters
- in=<file>
- Primary input file, or comma-delimited list of files. Also accepts 'in1' for the primary input file.
- in2=<file>
- Optional second file for paired reads. Used when processing paired-end sequencing data.
Algorithm Parameters
- k=31
- Kmer length for counting. Longer kmers provide higher specificity but may reduce sensitivity for short sequences.
- buckets=2048
- Number of buckets for counting. Higher values decrease variance for large datasets but increase memory usage. Must be a power of 2.
- seed=-1
- Seed for hash functions. A negative number forces a random seed, ensuring different results on each run. Positive values enable reproducible results.
- seed2=-1
- Secondary seed for synthetic data generation. Used when generating synthetic sequences for testing algorithm performance.
- minprob=0
- Exclude kmers with probability of being correct below this value (0-1). Helps filter out low-quality kmers that may represent sequencing errors.
Processing Parameters
- trials=1
- Number of independent trials to run. Multiple trials provide statistical measures including mean, median, and standard deviation of estimates.
- synth=false
- Generate synthetic data instead of reading input files. Used for algorithm testing and validation with known ground truth.
- verbose=false
- Enable verbose output including detailed processing information and intermediate statistics.
- trackcounts=false
- Track and report detailed count information during processing. Also accepts 'loglogcounts', 'loglogcount', 'count', 'counts'.
Output Parameters
- out=<file>
- Output file for cardinality estimates. If not specified, results are written to stderr.
- overwrite=true
- Allow overwriting existing output files.
- append=false
- Append results to existing output file instead of overwriting.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding autodetection. Example: -Xmx20g specifies 20 GB of RAM, -Xmx200m specifies 200 MB. Maximum is typically 85% of physical memory.
- -eoom
- Exit if an out-of-memory exception occurs. Requires Java 8u92 or later.
- -da
- Disable assertions for potentially faster execution.
Examples
Basic Cardinality Estimation
loglog.sh in=reads.fastq k=31
Estimate the number of unique 31-mers in a FASTQ file using default parameters.
High-Precision Estimation
loglog.sh in=assembly.fasta k=21 buckets=8192
Use more buckets for higher precision when analyzing a large assembly, with shorter kmers for better coverage.
Paired-End Data Analysis
loglog.sh in=reads_R1.fq in2=reads_R2.fq k=31 buckets=4096
Process paired-end reads together to get cardinality across both files.
Multiple Trials for Statistics
loglog.sh in=data.fasta k=31 trials=10 buckets=2048
Run 10 independent trials to obtain statistical measures including mean, median, and confidence intervals.
Quality-Filtered Estimation
loglog.sh in=reads.fq k=31 minprob=0.8 out=cardinality.txt
Only count kmers with at least 80% probability of being correct, saving results to a file.
Reproducible Analysis
loglog.sh in=data.fq k=31 seed=12345 trials=5
Use a fixed seed to ensure reproducible results across multiple runs.
Algorithm Details
LogLog Algorithm Implementation
The LogLog algorithm uses O(log log n) space complexity for cardinality estimation where n is the cardinality. BBTools implements LogLogWrapper with LogLogThread instances for multi-threaded processing:
- Hash Function: Uses Tools.hash64shift() to map kmers into buckets, distributing elements uniformly
- Leading Zero Tracking: Each bucket tracks the maximum number of leading zeros in hashed values
- Configurable Precision: More buckets reduce variance but increase memory usage - buckets must be powers of 2
- Thread Implementation: LogLogThread instances process reads in parallel, each managing CardinalityTracker objects
Cardinality Estimation Process
The estimation process follows CardinalityTracker.makeTracker() initialization:
- Tracker Creation: CardinalityTracker.makeTracker(buckets, k, seed, minProb) creates tracker instance
- Kmer Hashing: log.hash(r) method processes each read using Tools.hash64shift() function
- Leading Zero Counting: Each bucket maintains maximum leading zeros from hashed kmer values
- Cardinality Calculation: log.cardinality() method applies statistical formula to bucket counts
Statistical Output
When running multiple trials, LogLogWrapper calculates statistics using LongList methods:
- Harmonic Mean: results.harmonicMean() - primary cardinality estimate, most robust against outliers
- Median: results.median() - middle value of estimates, provides robust central tendency
- Range (95th - 5th percentile): results.percentile(0.95) - results.percentile(0.05) measures estimate spread
- Standard Deviation: results.stdev() measures estimate precision across trials
- Average Difference: results.avgDif(mean) - mean absolute deviation from the harmonic mean
Performance Characteristics
- Memory Usage: Default -Xmx200m (z="-Xmx200m" in shell script), scales with bucket count and thread usage
- Processing Speed: Linear in input size, parallelized using LogLogThread array with Shared.threads() threads
- Accuracy: Standard error decreases with more buckets (approximately 1.3/√m where m is bucket count)
- Space Complexity: O(log log n) where n is the actual cardinality
Quality Filtering
The minprob parameter enables quality-based filtering in CardinalityTracker:
- Kmers below the probability threshold are excluded from counting
- Helps remove sequencing errors that would inflate cardinality estimates
- Particularly useful for raw sequencing data with quality scores
- Value of 0 includes all kmers, while higher values increase stringency
Synthetic Data Generation
When synth=true, makeRead() method generates synthetic sequences:
- Uses AminoAcid.numberToBase[] for base conversion
- Generates 150-base reads with prime number modulus (32452843) for randomization
- Enables algorithm testing with known ground truth values
- Uses Shared.threadLocalRandom() for thread-safe random number generation
Supported Formats
Input Formats
- FASTQ: Standard format with quality scores (supports quality filtering)
- FASTA: Sequence-only format
- SCARF: Illumina sequence format
- SAM/BAM: Aligned sequence formats (sequences extracted for analysis)
- Compressed: Gzip (.gz) and bzip2 (.bz2) compression supported
- Standard Input: Use 'in=stdin' with format specification (e.g., 'in=stdin.fq.gz')
File Shortcuts
The # symbol can substitute for paired-end file numbering:
loglog.sh in=read#.fq
expands toin1=read1.fq in2=read2.fq
- Useful for processing paired-end data with standard naming conventions
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org