Stats3

Script: stats3.sh Package: jgi Class: AssemblyStats3.java

In progress. Generates some assembly stats for multiple files using the Assembly class for tabular output format.

Basic Usage

stats3.sh in=file
stats3.sh in=file,file
stats3.sh file file file

STATS3 is an assembly statistics tool that processes multiple FASTA files and outputs statistics in a tabular format. Unlike the STATS tool, STATS3 focuses on basic metrics and supports batch processing of multiple assemblies.

Parameters

STATS3 accepts minimal parameters focused on input/output specification. Multiple files can be processed in a single run.

Input/Output Parameters

in=file
Specify the input FASTA file(s), or stdin. Multiple files can be comma-separated or listed without the 'in=' flag. Each file will be processed and reported separately in the output table.
out=stdout
Destination of primary output; may be directed to a file. Output is in tab-delimited format suitable for spreadsheet analysis or further processing.

Output Format

STATS3 produces a tab-delimited table with the following columns:

fname   size    contigs gc      maxContig       5kplus  10kplus 25kplus 50kplus
fname
Input filename
size
Total assembly size in bases
contigs
Number of contigs/scaffolds
gc
GC content as decimal fraction (e.g., 0.432 = 43.2%)
maxContig
Length of longest contig in bases
5kplus
Total length of contigs ≥5,000 bp
10kplus
Total length of contigs ≥10,000 bp
25kplus
Total length of contigs ≥25,000 bp
50kplus
Total length of contigs ≥50,000 bp

Examples

Single File Analysis

stats3.sh in=assembly.fasta

Processes a single assembly file and outputs statistics to stdout.

Multiple Files with Comma Separation

stats3.sh in=assembly1.fasta,assembly2.fasta,assembly3.fasta

Processes three assembly files in sequence, outputting one row per file.

Multiple Files as Arguments

stats3.sh assembly1.fasta assembly2.fasta assembly3.fasta

Alternative syntax for processing multiple files without 'in=' flag.

Output to File

stats3.sh in=*.fasta out=assembly_stats.tsv

Processes all FASTA files in current directory and saves results to a TSV file.

Sample Output

fname           size    contigs gc      maxContig       5kplus  10kplus 25kplus 50kplus
assembly1.fasta 4832156 2341    0.423   84632          3921043 3456789 2987654 2134567
assembly2.fasta 3654289 1876    0.456   126784         3201456 2876543 2345678 1876543

Example output showing statistics for two assemblies.

Algorithm Details

Assembly Class Implementation

STATS3 utilizes the Assembly class that processes FASTA files using single-pass parsing with line-by-line sequence analysis:

File Processing Strategy

Base Composition Analysis

Length Threshold Calculations

Performance Characteristics

Differences from STATS

Use Cases

Assembly Comparison

Compare multiple assemblies from different assemblers or parameter sets to identify the best performing assembly based on key metrics.

Quality Assessment Pipeline

Integrate into automated pipelines using makeHeader() and processInner() methods for assembly quality assessment across large datasets.

Batch Processing

Process multiple assemblies using 120MB heap size with tab-delimited output format.

Downstream Analysis

Generate input data for plotting tools, statistical analysis, or assembly selection criteria.

Technical Notes

Input Requirements

Output Behavior

Error Handling

Support

For questions and support: