Stats
Generates assembly statistics including scaffold count, N50/L50, GC content, and gap percentage for single files. Designed to replace prior tools that could not scale to large metagenomes, handling assemblies and sequences of practically unbounded size. For analyzing multiple files simultaneously, use statswrapper.sh.
Design Purpose
Stats was created to address limitations of existing assembly analysis tools when processing large metagenomes. Prior tools could not handle the scale of modern metagenomic assemblies, which can contain millions of scaffolds and sequences of extreme length. Stats solves this scaling problem through a specialized architecture that processes assemblies of practically unbounded size in a small, fixed amount of memory.
Basic Usage
stats.sh in=<file>
Processes a single FASTA or FASTQ file (gzipped is fine) and outputs assembly statistics to stdout.
Parameters
Parameters are organized by function as defined in the shell script.
Basic Parameters
- in=file
- Specify the input fasta or fastq file, or use "stdin" for standard input. Supports gzipped files automatically.
- out=stdout
- Destination for primary statistical output. Default is stdout. Can be set to a filename for file output.
- gc=file
- Output file for per-scaffold base composition (ACGTN content).
- gchist=file
- Output file for GC content histogram showing the distribution of GC percentages across scaffolds.
- shist=file
- Output file for cumulative scaffold length histogram.
- gcbins=200
- Number of bins to use for the GC content histogram. Default is 200 bins.
- n=10
- Number of contiguous N characters that define a break between contigs within scaffolds. Default is 10.
- k=13
- K-mer length for estimating BBMap memory usage when mapping reads to this assembly. Default 13.
- minscaf=0
- Minimum scaffold length threshold. Scaffolds shorter than this value are excluded from statistics.
- phs=f
- (printheaderstats) Print the total size of FASTA/FASTQ headers in bytes.
- n90=t
- (printn90) Include N90/L90 metrics in output alongside N50/L50 statistics.
- extended=f
- Print extended metrics including L90, logsum, powersum, and assembly score. Automatically enables n90=t.
- pdl=f
- (printduplicatelines) Include lines in the scaffold size table where counts remain unchanged between size thresholds.
- n_=t
- Prefix column headers with 'n_' (e.g., 'n_scaffolds', 'n_contigs') in tabular output formats 3-6.
- addname=f
- Add a filename column to tabular output formats (3-6).
Logsum and Powsum Parameters
- logoffset=1000
- Minimum scaffold length for inclusion in logarithmic sum calculations. Shorter scaffolds are excluded.
- logbase=2
- Base for logarithmic calculations in logsum metric. Common values are 2 (binary) or e (natural log).
- logpower=1
- Exponential power applied to log values to weight longer scaffolds more heavily in logsum calculations.
- powsum=0.25
- Power applied to scaffold lengths in powersum metric. Values less than 1 favor longer scaffolds.
Assembly Score Metric Parameters
- score=f
- Calculate and display assembly score metric based on length distribution and aligned read fraction.
- aligned=0.0
- Fraction of reads aligned to the assembly (0.0-1.0). Adds aligned*50 bonus to assembly score.
- assemblyscoreminlen=2000
- Minimum scaffold length for inclusion in assembly score calculation. Shorter scaffolds are ignored.
- assemblyscoremaxlen=50000
- Maximum scaffold length for receiving bonus points in assembly score. Longer scaffolds don't get additional weight.
Output Format Parameters
- format=<0-8>
- Output format selection (default 1):
- 0: No assembly statistics printed
- 1: Human-readable format with units (KB, MB) for tool compatibility
- 2: Machine-parseable format with raw base counts, no commas
- 3: Tab-delimited table with header and data rows
- 4: Tab-delimited scaffold data only
- 5: Tab-delimited contig data only
- 6: Tab-delimited with comment-prefixed header (#)
- 7: Human-readable contig information only
- 8: JSON format output (also triggered by json=t)
- gcformat=<0-5>
- GC content output format (default 1):
- 0: No base composition information
- 1: Full format: name, length, A, C, G, T, N, IUPAC, Other, GC
- 2: Minimal format: name, GC percentage only
- 4: Compact format: name, length, GC percentage
- 5: Extended format: name, length, GC, logsum, powersum
Examples
Basic Assembly Statistics
# Generate basic statistics for an assembly
stats.sh in=contigs.fa
Outputs statistics including N50/L50, scaffold counts, and gap percentages to stdout.
Per-Sequence GC Content
# Print GC and length information per sequence
stats.sh in=contigs.fa gc=gc.txt gcformat=4
Outputs per-scaffold GC content and length data to a separate file.
Multiple Assembly Comparison
# Compare multiple assemblies using wrapper
statswrapper.sh in=a.fa,b.fa,c.fa format=6
Compares statistics from multiple assemblies in tab-delimited format with commented headers.
Extended Statistics with BBMap Memory Estimation
# Include extended metrics and BBMap memory estimate
stats.sh in=contigs.fa extended=t k=31
Calculates logsum, powersum, and estimates BBMap memory requirements for k=31.
Assembly Score with Alignment Data
# Calculate assembly score with known alignment rate
stats.sh in=contigs.fa score=t aligned=0.85
Computes assembly score incorporating the fraction of reads that align to the assembly.
Algorithm Details
Architectural Design
Stats uses a deliberately single-threaded architecture, unlike other BBTools. This design choice eliminates garbage collection overhead and independent I/O thread complexity, contributing to its fixed memory usage and predictable performance characteristics. The single-threaded approach processes assemblies sequentially but maintains consistent memory usage regardless of input size.
Dual Data Structure Strategy
AssemblyStats2 employs a dual data structure approach based on length-dependent storage selection:
- LongList Arrays (clist, slist, sclist1, sclist2): For scaffolds/contigs shorter than cutoff threshold (default 32KB), uses LongList.increment() method with direct array indexing where array[length] stores count of sequences with that length.
- ArrayList Storage (llist, tlist): For sequences ≥cutoff length, stores individual values in ArrayList<Long> (llist) and ArrayList<Triple> (tlist) to avoid sparse array memory waste.
This hybrid approach provides constant-time access for common short sequences while maintaining memory efficiency for large assemblies with diverse scaffold sizes.
N50/L50 Convention Clarification
Important: BBTools uses a different N50/L50 convention than many other tools:
- BBTools Convention: N50 = number of scaffolds, L50 = length threshold
- Standard Convention: N50 = length threshold, L50 = number of scaffolds
This reversed definition was chosen by Brian Bushnell to be more intuitive: "N" represents a Number of sequences, while "L" represents a Length threshold. Be aware of this difference when comparing results with other assembly analysis tools.
Statistical Metrics
Logsum Calculation
The logsum metric uses calcLogSumScaffolds() method with logarithmic scaling:
- Only scaffolds ≥ logSumOffset (default 1000bp) are included using length filter
- Formula: (1/Math.log(logSumBase)) × Math.log(length) × length, summed across qualifying scaffolds
- Optional power scaling: Math.pow(log × 0.12, logPower) when logPower ≠ 1
- Alternative square mode: log × log × 0.05 when squareLog flag is enabled
Powersum Calculation
The powersum metric uses calcPowerSumScaffolds() method with power-law scaling:
- Formula: (1/Math.pow(1000, powSumPower)) × Math.pow(length, powSumPower) × length
- Default powSumPower of 0.25 weights longer scaffolds moderately
- Lower powers increase preference for longer sequences in scoring
- Normalized by logSumOffset (default 1000bp) in powersum() helper method
Assembly Score Algorithm
The assembly score uses calcAssemblyScore() method combining length distribution with read alignment quality:
- Base score calculation: count × (length × Tools.min(assemblyScoreMaxLen, length)) × mult, where mult = 1.0/(1000 × lengthSum)
- Length weighting: scaffolds between assemblyScoreMinLen and assemblyScoreMaxLen receive proportional scores
- Alignment bonus: alignedFraction × 50 added to base score using supplied aligned parameter
- Processes both LongList arrays and ArrayList<Triple> for complete coverage
Sequence Processing
Contig Detection Algorithm
Contigs within scaffolds are identified by N-character runs using byte-level character analysis:
- Tracks consecutive N characters using ns counter variable and charToNum[c] mapping
- When ns reaches maxNs threshold (default 10), currentContigs.set(contigs, contiglen) records contig boundary
- Maintains separate LongList data structures for scaffolds (slist) vs. constituent contigs (clist)
- Uses identical parsing logic for both countFasta() and countFastq() methods
GC Content Analysis
Base composition analysis uses histogram binning with Tools.standardDeviationHistogram():
- Calculates GC percentage: (counts[1] + counts[2]) / (counts[0] + counts[1] + counts[2] + counts[3]) where counts[] maps A,C,G,T
- Bins scaffolds using index = Tools.min((gc × gcbins2)/acgt, gcbins2-1) for gchistArray[index]
- Tracks scaffold counts in gchistArray[] and total bases in gchist_by_base[] per GC bin
- Generates gc_std using Tools.standardDeviationHistogram(gchistArray)/gcbins2 normalization
Memory Management and Scalability
The algorithm implements several memory management strategies that enable processing of assemblies with practically unbounded size:
- Fixed Buffer Processing: Uses byte[] buf = new byte[32768] for consistent 32KB read chunks via is.read(buf)
- Pre-allocated Data Structures: LongList constructors use Tools.min(1<<15, cutoff+1) for size optimization
- Character-to-Number Mapping: Uses charToNum[] byte array for direct nucleotide encoding without string operations
- KillSwitch Memory Management: Uses KillSwitch.allocLong1D(8) for counts[] array allocation
- Constant Memory Usage: Fixed 120MB RAM regardless of assembly size - no memory growth with input size
Performance Characteristics
- Memory Usage: Fixed 120MB RAM regardless of assembly size
- Threading Model: Deliberately single-threaded with no garbage collection or independent I/O threads
- Scalability: Handles assemblies of practically unbounded size with sequences of practically unbounded length
- Processing Model: Single-pass sequential reading with 32KB buffer chunks
- Data Structure Selection: Length-dependent storage (arrays vs. lists) based on cutoff threshold
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Guide: Read bbtools/docs/guides/StatsGuide.txt for additional information