Stats3
In progress. Generates some assembly stats for multiple files using the Assembly class for tabular output format.
Basic Usage
stats3.sh in=file
stats3.sh in=file,file
stats3.sh file file file
STATS3 is an assembly statistics tool that processes multiple FASTA files and outputs statistics in a tabular format. Unlike the STATS tool, STATS3 focuses on basic metrics and supports batch processing of multiple assemblies.
Parameters
STATS3 accepts minimal parameters focused on input/output specification. Multiple files can be processed in a single run.
Input/Output Parameters
- in=file
- Specify the input FASTA file(s), or stdin. Multiple files can be comma-separated or listed without the 'in=' flag. Each file will be processed and reported separately in the output table.
- out=stdout
- Destination of primary output; may be directed to a file. Output is in tab-delimited format suitable for spreadsheet analysis or further processing.
Output Format
STATS3 produces a tab-delimited table with the following columns:
fname size contigs gc maxContig 5kplus 10kplus 25kplus 50kplus
- fname
- Input filename
- size
- Total assembly size in bases
- contigs
- Number of contigs/scaffolds
- gc
- GC content as decimal fraction (e.g., 0.432 = 43.2%)
- maxContig
- Length of longest contig in bases
- 5kplus
- Total length of contigs ≥5,000 bp
- 10kplus
- Total length of contigs ≥10,000 bp
- 25kplus
- Total length of contigs ≥25,000 bp
- 50kplus
- Total length of contigs ≥50,000 bp
Examples
Single File Analysis
stats3.sh in=assembly.fasta
Processes a single assembly file and outputs statistics to stdout.
Multiple Files with Comma Separation
stats3.sh in=assembly1.fasta,assembly2.fasta,assembly3.fasta
Processes three assembly files in sequence, outputting one row per file.
Multiple Files as Arguments
stats3.sh assembly1.fasta assembly2.fasta assembly3.fasta
Alternative syntax for processing multiple files without 'in=' flag.
Output to File
stats3.sh in=*.fasta out=assembly_stats.tsv
Processes all FASTA files in current directory and saves results to a TSV file.
Sample Output
fname size contigs gc maxContig 5kplus 10kplus 25kplus 50kplus
assembly1.fasta 4832156 2341 0.423 84632 3921043 3456789 2987654 2134567
assembly2.fasta 3654289 1876 0.456 126784 3201456 2876543 2345678 1876543
Example output showing statistics for two assemblies.
Algorithm Details
Assembly Class Implementation
STATS3 utilizes the Assembly class that processes FASTA files using single-pass parsing with line-by-line sequence analysis:
File Processing Strategy
- Single-pass parsing: Reads FASTA files once, processing headers and sequences line-by-line
- IntList data structure: Uses IntList class for contig lengths, avoiding storage of full sequences
- Descending sort: Contig lengths are sorted in descending order using IntList.sort() and IntList.reverse() methods
Base Composition Analysis
- baseToACGTNIO array: Uses 128-element byte array to map bases to categories A, C, G, T, U, N, IUPAC ambiguity codes, and other characters
- Case-insensitive: Both uppercase and lowercase bases are recognized
- GC calculation: Uses gc() method that divides (G+C) by (A+T+U+G+C), excluding N and IUPAC ambiguity codes
Length Threshold Calculations
- Sorted processing: Since contigs are sorted by length (largest first), thresholding stops at first contig below threshold
- Cumulative sums: Length thresholds (5k+, 10k+, etc.) represent total sequence length in contigs meeting criteria
- O(n) complexity: Each threshold calculation processes contigs only until the cutoff is reached
Performance Characteristics
- Memory usage: 120MB default heap size - stores IntList contig lengths, not sequences
- File I/O: Single-pass file reading with ByteFile.nextLine() method
- Batch processing: Processes multiple assemblies using for-loop iteration with processInner() method calls
- Output format: Tab-delimited format using ByteStreamWriter with tab() and nl() methods
Differences from STATS
- Simplified metrics: Focuses on essential statistics without N50/L50 calculations
- Tabular output: Always produces machine-readable tabular format
- Multiple file orientation: Designed specifically for comparative analysis of multiple assemblies
- Development status: Currently marked as "in progress" - additional features may be added
Use Cases
Assembly Comparison
Compare multiple assemblies from different assemblers or parameter sets to identify the best performing assembly based on key metrics.
Quality Assessment Pipeline
Integrate into automated pipelines using makeHeader() and processInner() methods for assembly quality assessment across large datasets.
Batch Processing
Process multiple assemblies using 120MB heap size with tab-delimited output format.
Downstream Analysis
Generate input data for plotting tools, statistical analysis, or assembly selection criteria.
Technical Notes
Input Requirements
- FASTA format files (compressed files supported)
- Files must be accessible at specified paths
- No special formatting requirements - standard FASTA headers and sequences
Output Behavior
- Header line printed once at beginning of output
- One data line per input file
- Tab-delimited format for easy parsing
- Numeric values use decimal precision (3 places) for GC content
Error Handling
- Skips files that cannot be read
- Reports processing status to stderr
- Continues processing remaining files if one file fails
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org