Stats

Design Purpose

Stats was created to address limitations of existing assembly analysis tools when processing large metagenomes. Prior tools could not handle the scale of modern metagenomic assemblies, which can contain millions of scaffolds and sequences of extreme length. Stats solves this scaling problem through a specialized architecture that processes assemblies of practically unbounded size in a small, fixed amount of memory.

Basic Usage

stats.sh in=<file>

Processes a single FASTA or FASTQ file (gzipped is fine) and outputs assembly statistics to stdout.

Parameters

Parameters are organized by function as defined in the shell script.

Basic Parameters

in=file: Specify the input fasta or fastq file, or use "stdin" for standard input. Supports gzipped files automatically.
out=stdout: Destination for primary statistical output. Default is stdout. Can be set to a filename for file output.
gc=file: Output file for per-scaffold base composition (ACGTN content).
gchist=file: Output file for GC content histogram showing the distribution of GC percentages across scaffolds.
shist=file: Output file for cumulative scaffold length histogram.
gcbins=200: Number of bins to use for the GC content histogram. Default is 200 bins.
n=10: Number of contiguous N characters that define a break between contigs within scaffolds. Default is 10.
k=13: K-mer length for estimating BBMap memory usage when mapping reads to this assembly. Default 13.
minscaf=0: Minimum scaffold length threshold. Scaffolds shorter than this value are excluded from statistics.
phs=f: (printheaderstats) Print the total size of FASTA/FASTQ headers in bytes.
n90=t: (printn90) Include N90/L90 metrics in output alongside N50/L50 statistics.
extended=f: Print extended metrics including L90, logsum, powersum, and assembly score. Automatically enables n90=t.
pdl=f: (printduplicatelines) Include lines in the scaffold size table where counts remain unchanged between size thresholds.
n_=t: Prefix column headers with 'n_' (e.g., 'n_scaffolds', 'n_contigs') in tabular output formats 3-6.
addname=f: Add a filename column to tabular output formats (3-6).

Logsum and Powsum Parameters

logoffset=1000: Minimum scaffold length for inclusion in logarithmic sum calculations. Shorter scaffolds are excluded.
logbase=2: Base for logarithmic calculations in logsum metric. Common values are 2 (binary) or e (natural log).
logpower=1: Exponential power applied to log values to weight longer scaffolds more heavily in logsum calculations.
powsum=0.25: Power applied to scaffold lengths in powersum metric. Values less than 1 favor longer scaffolds.

Assembly Score Metric Parameters

score=f: Calculate and display assembly score metric based on length distribution and aligned read fraction.
aligned=0.0: Fraction of reads aligned to the assembly (0.0-1.0). Adds aligned*50 bonus to assembly score.
assemblyscoreminlen=2000: Minimum scaffold length for inclusion in assembly score calculation. Shorter scaffolds are ignored.
assemblyscoremaxlen=50000: Maximum scaffold length for receiving bonus points in assembly score. Longer scaffolds don't get additional weight.

Output Format Parameters

format=<0-8>

Output format selection (default 1):

0: No assembly statistics printed
1: Human-readable format with units (KB, MB) for tool compatibility
2: Machine-parseable format with raw base counts, no commas
3: Tab-delimited table with header and data rows
4: Tab-delimited scaffold data only
5: Tab-delimited contig data only
6: Tab-delimited with comment-prefixed header (#)
7: Human-readable contig information only
8: JSON format output (also triggered by json=t)

gcformat=<0-5>

GC content output format (default 1):

0: No base composition information
1: Full format: name, length, A, C, G, T, N, IUPAC, Other, GC
2: Minimal format: name, GC percentage only
4: Compact format: name, length, GC percentage
5: Extended format: name, length, GC, logsum, powersum

Note: In format 1, A+C+G+T fractions sum to 1.0 even when N > 0.

Examples

Basic Assembly Statistics

# Generate basic statistics for an assembly
stats.sh in=contigs.fa

Outputs statistics including N50/L50, scaffold counts, and gap percentages to stdout.

Per-Sequence GC Content

# Print GC and length information per sequence
stats.sh in=contigs.fa gc=gc.txt gcformat=4

Outputs per-scaffold GC content and length data to a separate file.

Multiple Assembly Comparison

# Compare multiple assemblies using wrapper
statswrapper.sh in=a.fa,b.fa,c.fa format=6

Compares statistics from multiple assemblies in tab-delimited format with commented headers.

Extended Statistics with BBMap Memory Estimation

# Include extended metrics and BBMap memory estimate
stats.sh in=contigs.fa extended=t k=31

Calculates logsum, powersum, and estimates BBMap memory requirements for k=31.

Assembly Score with Alignment Data

# Calculate assembly score with known alignment rate
stats.sh in=contigs.fa score=t aligned=0.85

Computes assembly score incorporating the fraction of reads that align to the assembly.

Algorithm Details

Architectural Design

Stats uses a deliberately single-threaded architecture, unlike other BBTools. This design choice eliminates garbage collection overhead and independent I/O thread complexity, contributing to its fixed memory usage and predictable performance characteristics. The single-threaded approach processes assemblies sequentially but maintains consistent memory usage regardless of input size.

Dual Data Structure Strategy

AssemblyStats2 employs a dual data structure approach based on length-dependent storage selection:

LongList Arrays (clist, slist, sclist1, sclist2): For scaffolds/contigs shorter than cutoff threshold (default 32KB), uses LongList.increment() method with direct array indexing where array[length] stores count of sequences with that length.
ArrayList Storage (llist, tlist): For sequences ≥cutoff length, stores individual values in ArrayList<Long> (llist) and ArrayList<Triple> (tlist) to avoid sparse array memory waste.

This hybrid approach provides constant-time access for common short sequences while maintaining memory efficiency for large assemblies with diverse scaffold sizes.

N50/L50 Convention Clarification

Important: BBTools uses a different N50/L50 convention than many other tools:

BBTools Convention: N50 = number of scaffolds, L50 = length threshold
Standard Convention: N50 = length threshold, L50 = number of scaffolds

This reversed definition was chosen by Brian Bushnell to be more intuitive: "N" represents a Number of sequences, while "L" represents a Length threshold. Be aware of this difference when comparing results with other assembly analysis tools.

Statistical Metrics

Logsum Calculation

The logsum metric uses calcLogSumScaffolds() method with logarithmic scaling:

Only scaffolds ≥ logSumOffset (default 1000bp) are included using length filter
Formula: (1/Math.log(logSumBase)) × Math.log(length) × length, summed across qualifying scaffolds
Optional power scaling: Math.pow(log × 0.12, logPower) when logPower ≠ 1
Alternative square mode: log × log × 0.05 when squareLog flag is enabled

Powersum Calculation

The powersum metric uses calcPowerSumScaffolds() method with power-law scaling:

Formula: (1/Math.pow(1000, powSumPower)) × Math.pow(length, powSumPower) × length
Default powSumPower of 0.25 weights longer scaffolds moderately
Lower powers increase preference for longer sequences in scoring
Normalized by logSumOffset (default 1000bp) in powersum() helper method

Assembly Score Algorithm

The assembly score uses calcAssemblyScore() method combining length distribution with read alignment quality:

Base score calculation: count × (length × Tools.min(assemblyScoreMaxLen, length)) × mult, where mult = 1.0/(1000 × lengthSum)
Length weighting: scaffolds between assemblyScoreMinLen and assemblyScoreMaxLen receive proportional scores
Alignment bonus: alignedFraction × 50 added to base score using supplied aligned parameter
Processes both LongList arrays and ArrayList<Triple> for complete coverage

Sequence Processing

Contig Detection Algorithm

Contigs within scaffolds are identified by N-character runs using byte-level character analysis:

Tracks consecutive N characters using ns counter variable and charToNum[c] mapping
When ns reaches maxNs threshold (default 10), currentContigs.set(contigs, contiglen) records contig boundary
Maintains separate LongList data structures for scaffolds (slist) vs. constituent contigs (clist)
Uses identical parsing logic for both countFasta() and countFastq() methods

GC Content Analysis

Base composition analysis uses histogram binning with Tools.standardDeviationHistogram():

Calculates GC percentage: (counts[1] + counts[2]) / (counts[0] + counts[1] + counts[2] + counts[3]) where counts[] maps A,C,G,T
Bins scaffolds using index = Tools.min((gc × gcbins2)/acgt, gcbins2-1) for gchistArray[index]
Tracks scaffold counts in gchistArray[] and total bases in gchist_by_base[] per GC bin
Generates gc_std using Tools.standardDeviationHistogram(gchistArray)/gcbins2 normalization

Memory Management and Scalability

The algorithm implements several memory management strategies that enable processing of assemblies with practically unbounded size:

Fixed Buffer Processing: Uses byte[] buf = new byte[32768] for consistent 32KB read chunks via is.read(buf)
Pre-allocated Data Structures: LongList constructors use Tools.min(1<<15, cutoff+1) for size optimization
Character-to-Number Mapping: Uses charToNum[] byte array for direct nucleotide encoding without string operations
KillSwitch Memory Management: Uses KillSwitch.allocLong1D(8) for counts[] array allocation
Constant Memory Usage: Fixed 120MB RAM regardless of assembly size - no memory growth with input size

Performance Characteristics

Memory Usage: Fixed 120MB RAM regardless of assembly size
Threading Model: Deliberately single-threaded with no garbage collection or independent I/O threads
Scalability: Handles assemblies of practically unbounded size with sequences of practically unbounded length
Processing Model: Single-pass sequential reading with 32KB buffer chunks
Data Structure Selection: Length-dependent storage (arrays vs. lists) based on cutoff threshold

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Guide: Read bbtools/docs/guides/StatsGuide.txt for additional information