LogLog

Script: loglog.sh Package: cardinality Class: LogLogWrapper.java

Estimates cardinality of unique kmers in sequence data using the LogLog algorithm. Uses Tools.hash64shift() function to map kmers into buckets, tracking maximum leading zeros for cardinality estimation and supports multi-threaded processing.

Basic Usage

loglog.sh in=<file> k=<31>

The LogLog algorithm estimates unique kmer cardinality using O(log log n) space complexity where n is the cardinality. Uses CardinalityTracker.makeTracker() to create tracker instances with Tools.hash64shift() for uniform hash distribution across buckets.

Parameters

Parameters control the LogLog algorithm's precision, input handling, and output options.

Input Parameters

in=<file>
Primary input file, or comma-delimited list of files. Also accepts 'in1' for the primary input file.
in2=<file>
Optional second file for paired reads. Used when processing paired-end sequencing data.

Algorithm Parameters

k=31
Kmer length for counting. Longer kmers provide higher specificity but may reduce sensitivity for short sequences.
buckets=2048
Number of buckets for counting. Higher values decrease variance for large datasets but increase memory usage. Must be a power of 2.
seed=-1
Seed for hash functions. A negative number forces a random seed, ensuring different results on each run. Positive values enable reproducible results.
seed2=-1
Secondary seed for synthetic data generation. Used when generating synthetic sequences for testing algorithm performance.
minprob=0
Exclude kmers with probability of being correct below this value (0-1). Helps filter out low-quality kmers that may represent sequencing errors.

Processing Parameters

trials=1
Number of independent trials to run. Multiple trials provide statistical measures including mean, median, and standard deviation of estimates.
synth=false
Generate synthetic data instead of reading input files. Used for algorithm testing and validation with known ground truth.
verbose=false
Enable verbose output including detailed processing information and intermediate statistics.
trackcounts=false
Track and report detailed count information during processing. Also accepts 'loglogcounts', 'loglogcount', 'count', 'counts'.

Output Parameters

out=<file>
Output file for cardinality estimates. If not specified, results are written to stderr.
overwrite=true
Allow overwriting existing output files.
append=false
Append results to existing output file instead of overwriting.

Java Parameters

-Xmx
Set Java's memory usage, overriding autodetection. Example: -Xmx20g specifies 20 GB of RAM, -Xmx200m specifies 200 MB. Maximum is typically 85% of physical memory.
-eoom
Exit if an out-of-memory exception occurs. Requires Java 8u92 or later.
-da
Disable assertions for potentially faster execution.

Examples

Basic Cardinality Estimation

loglog.sh in=reads.fastq k=31

Estimate the number of unique 31-mers in a FASTQ file using default parameters.

High-Precision Estimation

loglog.sh in=assembly.fasta k=21 buckets=8192

Use more buckets for higher precision when analyzing a large assembly, with shorter kmers for better coverage.

Paired-End Data Analysis

loglog.sh in=reads_R1.fq in2=reads_R2.fq k=31 buckets=4096

Process paired-end reads together to get cardinality across both files.

Multiple Trials for Statistics

loglog.sh in=data.fasta k=31 trials=10 buckets=2048

Run 10 independent trials to obtain statistical measures including mean, median, and confidence intervals.

Quality-Filtered Estimation

loglog.sh in=reads.fq k=31 minprob=0.8 out=cardinality.txt

Only count kmers with at least 80% probability of being correct, saving results to a file.

Reproducible Analysis

loglog.sh in=data.fq k=31 seed=12345 trials=5

Use a fixed seed to ensure reproducible results across multiple runs.

Algorithm Details

LogLog Algorithm Implementation

The LogLog algorithm uses O(log log n) space complexity for cardinality estimation where n is the cardinality. BBTools implements LogLogWrapper with LogLogThread instances for multi-threaded processing:

Cardinality Estimation Process

The estimation process follows CardinalityTracker.makeTracker() initialization:

  1. Tracker Creation: CardinalityTracker.makeTracker(buckets, k, seed, minProb) creates tracker instance
  2. Kmer Hashing: log.hash(r) method processes each read using Tools.hash64shift() function
  3. Leading Zero Counting: Each bucket maintains maximum leading zeros from hashed kmer values
  4. Cardinality Calculation: log.cardinality() method applies statistical formula to bucket counts

Statistical Output

When running multiple trials, LogLogWrapper calculates statistics using LongList methods:

Performance Characteristics

Quality Filtering

The minprob parameter enables quality-based filtering in CardinalityTracker:

Synthetic Data Generation

When synth=true, makeRead() method generates synthetic sequences:

Supported Formats

Input Formats

File Shortcuts

The # symbol can substitute for paired-end file numbering:

Support

For questions and support: