LogLog

Basic Usage

loglog.sh in=<file> k=<31>

The LogLog algorithm estimates unique kmer cardinality using O(log log n) space complexity where n is the cardinality. Uses CardinalityTracker.makeTracker() to create tracker instances with Tools.hash64shift() for uniform hash distribution across buckets.

Parameters

Parameters control the LogLog algorithm's precision, input handling, and output options.

Input Parameters

in=<file>: Primary input file, or comma-delimited list of files. Also accepts 'in1' for the primary input file.
in2=<file>: Optional second file for paired reads. Used when processing paired-end sequencing data.

Algorithm Parameters

k=31: Kmer length for counting. Longer kmers provide higher specificity but may reduce sensitivity for short sequences.
buckets=2048: Number of buckets for counting. Higher values decrease variance for large datasets but increase memory usage. Must be a power of 2.
seed=-1: Seed for hash functions. A negative number forces a random seed, ensuring different results on each run. Positive values enable reproducible results.
seed2=-1: Secondary seed for synthetic data generation. Used when generating synthetic sequences for testing algorithm performance.
minprob=0: Exclude kmers with probability of being correct below this value (0-1). Helps filter out low-quality kmers that may represent sequencing errors.

Processing Parameters

trials=1: Number of independent trials to run. Multiple trials provide statistical measures including mean, median, and standard deviation of estimates.
synth=false: Generate synthetic data instead of reading input files. Used for algorithm testing and validation with known ground truth.
verbose=false: Enable verbose output including detailed processing information and intermediate statistics.
trackcounts=false: Track and report detailed count information during processing. Also accepts 'loglogcounts', 'loglogcount', 'count', 'counts'.

Output Parameters

out=<file>: Output file for cardinality estimates. If not specified, results are written to stderr.
overwrite=true: Allow overwriting existing output files.
append=false: Append results to existing output file instead of overwriting.

Java Parameters

-Xmx: Set Java's memory usage, overriding autodetection. Example: -Xmx20g specifies 20 GB of RAM, -Xmx200m specifies 200 MB. Maximum is typically 85% of physical memory.
-eoom: Exit if an out-of-memory exception occurs. Requires Java 8u92 or later.
-da: Disable assertions for potentially faster execution.

Examples

Basic Cardinality Estimation

loglog.sh in=reads.fastq k=31

Estimate the number of unique 31-mers in a FASTQ file using default parameters.

High-Precision Estimation

loglog.sh in=assembly.fasta k=21 buckets=8192

Use more buckets for higher precision when analyzing a large assembly, with shorter kmers for better coverage.

Paired-End Data Analysis

loglog.sh in=reads_R1.fq in2=reads_R2.fq k=31 buckets=4096

Process paired-end reads together to get cardinality across both files.

Multiple Trials for Statistics

loglog.sh in=data.fasta k=31 trials=10 buckets=2048

Run 10 independent trials to obtain statistical measures including mean, median, and confidence intervals.

Quality-Filtered Estimation

loglog.sh in=reads.fq k=31 minprob=0.8 out=cardinality.txt

Only count kmers with at least 80% probability of being correct, saving results to a file.

Reproducible Analysis

loglog.sh in=data.fq k=31 seed=12345 trials=5

Use a fixed seed to ensure reproducible results across multiple runs.

Algorithm Details

LogLog Algorithm Implementation

The LogLog algorithm uses O(log log n) space complexity for cardinality estimation where n is the cardinality. BBTools implements LogLogWrapper with LogLogThread instances for multi-threaded processing:

Hash Function: Uses Tools.hash64shift() to map kmers into buckets, distributing elements uniformly
Leading Zero Tracking: Each bucket tracks the maximum number of leading zeros in hashed values
Configurable Precision: More buckets reduce variance but increase memory usage - buckets must be powers of 2
Thread Implementation: LogLogThread instances process reads in parallel, each managing CardinalityTracker objects

Cardinality Estimation Process

The estimation process follows CardinalityTracker.makeTracker() initialization:

Tracker Creation: CardinalityTracker.makeTracker(buckets, k, seed, minProb) creates tracker instance
Kmer Hashing: log.hash(r) method processes each read using Tools.hash64shift() function
Leading Zero Counting: Each bucket maintains maximum leading zeros from hashed kmer values
Cardinality Calculation: log.cardinality() method applies statistical formula to bucket counts

Statistical Output

When running multiple trials, LogLogWrapper calculates statistics using LongList methods:

Harmonic Mean: results.harmonicMean() - primary cardinality estimate, most robust against outliers
Median: results.median() - middle value of estimates, provides robust central tendency
Range (95th - 5th percentile): results.percentile(0.95) - results.percentile(0.05) measures estimate spread
Standard Deviation: results.stdev() measures estimate precision across trials
Average Difference: results.avgDif(mean) - mean absolute deviation from the harmonic mean

Performance Characteristics

Memory Usage: Default -Xmx200m (z="-Xmx200m" in shell script), scales with bucket count and thread usage
Processing Speed: Linear in input size, parallelized using LogLogThread array with Shared.threads() threads
Accuracy: Standard error decreases with more buckets (approximately 1.3/√m where m is bucket count)
Space Complexity: O(log log n) where n is the actual cardinality

Quality Filtering

The minprob parameter enables quality-based filtering in CardinalityTracker:

Kmers below the probability threshold are excluded from counting
Helps remove sequencing errors that would inflate cardinality estimates
Particularly useful for raw sequencing data with quality scores
Value of 0 includes all kmers, while higher values increase stringency

Synthetic Data Generation

When synth=true, makeRead() method generates synthetic sequences:

Uses AminoAcid.numberToBase[] for base conversion
Generates 150-base reads with prime number modulus (32452843) for randomization
Enables algorithm testing with known ground truth values
Uses Shared.threadLocalRandom() for thread-safe random number generation

Supported Formats

Input Formats

FASTQ: Standard format with quality scores (supports quality filtering)
FASTA: Sequence-only format
SCARF: Illumina sequence format
SAM/BAM: Aligned sequence formats (sequences extracted for analysis)
Compressed: Gzip (.gz) and bzip2 (.bz2) compression supported
Standard Input: Use 'in=stdin' with format specification (e.g., 'in=stdin.fq.gz')

File Shortcuts

The # symbol can substitute for paired-end file numbering:

loglog.sh in=read#.fq expands to in1=read1.fq in2=read2.fq
Useful for processing paired-end data with standard naming conventions

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org