Stats3

Basic Usage

stats3.sh in=file
stats3.sh in=file,file
stats3.sh file file file

STATS3 is an assembly statistics tool that processes multiple FASTA files and outputs statistics in a tabular format. Unlike the STATS tool, STATS3 focuses on basic metrics and supports batch processing of multiple assemblies.

Parameters

STATS3 accepts minimal parameters focused on input/output specification. Multiple files can be processed in a single run.

Input/Output Parameters

in=file: Specify the input FASTA file(s), or stdin. Multiple files can be comma-separated or listed without the 'in=' flag. Each file will be processed and reported separately in the output table.
out=stdout: Destination of primary output; may be directed to a file. Output is in tab-delimited format suitable for spreadsheet analysis or further processing.

Output Format

STATS3 produces a tab-delimited table with the following columns:

fname   size    contigs gc      maxContig       5kplus  10kplus 25kplus 50kplus

fname: Input filename
size: Total assembly size in bases
contigs: Number of contigs/scaffolds
gc: GC content as decimal fraction (e.g., 0.432 = 43.2%)
maxContig: Length of longest contig in bases
5kplus: Total length of contigs ≥5,000 bp
10kplus: Total length of contigs ≥10,000 bp
25kplus: Total length of contigs ≥25,000 bp
50kplus: Total length of contigs ≥50,000 bp

Examples

Single File Analysis

stats3.sh in=assembly.fasta

Processes a single assembly file and outputs statistics to stdout.

Multiple Files with Comma Separation

stats3.sh in=assembly1.fasta,assembly2.fasta,assembly3.fasta

Processes three assembly files in sequence, outputting one row per file.

Multiple Files as Arguments

stats3.sh assembly1.fasta assembly2.fasta assembly3.fasta

Alternative syntax for processing multiple files without 'in=' flag.

Output to File

stats3.sh in=*.fasta out=assembly_stats.tsv

Processes all FASTA files in current directory and saves results to a TSV file.

Sample Output

fname           size    contigs gc      maxContig       5kplus  10kplus 25kplus 50kplus
assembly1.fasta 4832156 2341    0.423   84632          3921043 3456789 2987654 2134567
assembly2.fasta 3654289 1876    0.456   126784         3201456 2876543 2345678 1876543

Example output showing statistics for two assemblies.

Algorithm Details

Assembly Class Implementation

STATS3 utilizes the Assembly class that processes FASTA files using single-pass parsing with line-by-line sequence analysis:

File Processing Strategy

Single-pass parsing: Reads FASTA files once, processing headers and sequences line-by-line
IntList data structure: Uses IntList class for contig lengths, avoiding storage of full sequences
Descending sort: Contig lengths are sorted in descending order using IntList.sort() and IntList.reverse() methods

Base Composition Analysis

baseToACGTNIO array: Uses 128-element byte array to map bases to categories A, C, G, T, U, N, IUPAC ambiguity codes, and other characters
Case-insensitive: Both uppercase and lowercase bases are recognized
GC calculation: Uses gc() method that divides (G+C) by (A+T+U+G+C), excluding N and IUPAC ambiguity codes

Length Threshold Calculations

Sorted processing: Since contigs are sorted by length (largest first), thresholding stops at first contig below threshold
Cumulative sums: Length thresholds (5k+, 10k+, etc.) represent total sequence length in contigs meeting criteria
O(n) complexity: Each threshold calculation processes contigs only until the cutoff is reached

Performance Characteristics

Memory usage: 120MB default heap size - stores IntList contig lengths, not sequences
File I/O: Single-pass file reading with ByteFile.nextLine() method
Batch processing: Processes multiple assemblies using for-loop iteration with processInner() method calls
Output format: Tab-delimited format using ByteStreamWriter with tab() and nl() methods

Differences from STATS

Simplified metrics: Focuses on essential statistics without N50/L50 calculations
Tabular output: Always produces machine-readable tabular format
Multiple file orientation: Designed specifically for comparative analysis of multiple assemblies
Development status: Currently marked as "in progress" - additional features may be added

Use Cases

Assembly Comparison

Compare multiple assemblies from different assemblers or parameter sets to identify the best performing assembly based on key metrics.

Quality Assessment Pipeline

Integrate into automated pipelines using makeHeader() and processInner() methods for assembly quality assessment across large datasets.

Batch Processing

Process multiple assemblies using 120MB heap size with tab-delimited output format.

Downstream Analysis

Generate input data for plotting tools, statistical analysis, or assembly selection criteria.

Technical Notes

Input Requirements

FASTA format files (compressed files supported)
Files must be accessible at specified paths
No special formatting requirements - standard FASTA headers and sequences

Output Behavior

Header line printed once at beginning of output
One data line per input file
Tab-delimited format for easy parsing
Numeric values use decimal precision (3 places) for GC content

Error Handling

Skips files that cannot be read
Reports processing status to stderr
Continues processing remaining files if one file fails

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org