StatsWrapper
Runs stats.sh on multiple assemblies to produce one output line per file. This wrapper simplifies batch processing of multiple FASTA files by automatically calling the underlying stats functionality for each input file and consolidating the output into a tabular format with headers.
Basic Usage
statswrapper.sh in=<input file>
For multiple files, specify comma-separated filenames or use shell wildcards. The tool automatically handles file detection and processes each assembly separately while maintaining consistent output formatting.
Parameters
Parameters control the assembly statistics calculation and output formatting. Most parameters are passed through to the underlying stats.sh functionality, with wrapper-specific handling for multiple file processing.
Input/Output Parameters
- in=<file>
- Specify the input fasta file, or stdin. For multiple files a, b, and c: 'statswrapper.sh in=a,b,c'. 'in=' may be omitted if this is the first arg, and asterisks may be used; e.g. statswrapper.sh *.fa. The wrapper automatically detects individual files versus comma-separated lists and processes each file separately.
- gc=<file>
- Writes ACGTN content per scaffold to a file. When processing multiple input files, GC content from all files is written to the same output file, with scaffold names preserving the source file context.
- gchist=<file>
- Filename to output scaffold GC content histogram. The histogram combines data from all processed files to provide an overall GC distribution across all input assemblies.
Statistical Parameters
- gcbins=<200>
- Number of bins for GC histogram. Controls the resolution of GC content distribution analysis across all processed files. Default: 200
- n=<10>
- Number of contiguous Ns to signify a break between contigs. This parameter affects contig vs scaffold distinction in the statistics calculation. Default: 10
- k=<13>
- Estimate memory usage of BBMap with this kmer length. Used for providing memory usage estimates in the output statistics. Default: 13
- minscaf=<0>
- Ignore scaffolds shorter than this length in the statistics calculation. Useful for filtering out very short scaffolds that may not be biologically meaningful. Default: 0 (include all scaffolds)
Output Formatting Parameters
- n_=<t>
- This flag will prefix the terms 'contigs' and 'scaffolds' with 'n_' in formats 3-6. Useful for distinguishing count metrics from length metrics in tabular output. Default: true
- addname=<t>
- Adds a column for input file name, for formats 3-6. Essential for multi-file processing as it identifies which file each statistics row corresponds to. Default: true (automatically enabled by wrapper)
- format=<1 through 6>
- Format of the stats information. Default is format=3.
- format=1: uses variable units like MB and KB, and is designed for compatibility with existing tools
- format=2: uses only whole numbers of bases, with no commas in numbers, and is designed for machine parsing
- format=3: outputs stats in 2 rows of tab-delimited columns: a header row and a data row
- format=4: is like 3 but with scaffold data only
- format=5: is like 3 but with contig data only
- format=6: is like 3 but the header starts with a #
- gcformat=<1 or 2>
- Select GC output format. Controls the format of per-scaffold GC content output when gc= parameter is specified.
- gcformat=1: name start stop A C G T N GC
- gcformat=2: name GC
Examples
Process Multiple FASTA Files
statswrapper.sh assembly1.fasta,assembly2.fasta,assembly3.fasta
Processes three assembly files and outputs statistics for each in tabular format with headers and file name identification.
Use Wildcard Pattern
statswrapper.sh *.fa format=3 addname=t
Processes all .fa files in the current directory using tabular output format with file names included.
Generate GC Content Analysis
statswrapper.sh in=assemblies/*.fasta gc=gc_content.tsv gchist=gc_histogram.tsv gcbins=100
Processes all FASTA files in the assemblies directory, outputting per-scaffold GC content and a combined GC histogram with 100 bins.
Filter Short Scaffolds
statswrapper.sh *.fasta minscaf=1000 format=4
Processes all FASTA files, ignoring scaffolds shorter than 1000bp and outputting only scaffold statistics in tabular format.
Algorithm Details
STATSWRAPPER implements a multi-file processing wrapper around the BBTools assembly statistics engine. The wrapper provides several key algorithmic advantages:
Multi-File Processing Architecture
The wrapper uses ArrayList-based dual-phase processing: first, it parses command line arguments to populate separate ArrayLists for parameters (alist) and input files (ilist), then calls AssemblyStats2.process() iteratively for each file. Parameters are preserved via args2 array while input files are processed through args2[0] replacement for each iteration.
Automatic Header Management
For tabular output formats (3-6), the wrapper sets args2[1]="header=t" for the first file and args2[1]="header=f" for subsequent files (i>0). This prevents header duplication by explicitly controlling the header parameter passed to each AssemblyStats2 instance.
Memory Optimization Strategy
Between file processing (i>0), the wrapper calls System.gc() for explicit garbage collection, uses synchronized wait(100) on AssemblyStatsWrapper.class, and Thread.yield() to manage memory. This prevents memory accumulation by forcing cleanup between AssemblyStats2 instance processing cycles.
File Detection Algorithm
The wrapper uses Tools.canRead(arg) to detect bare filenames and checks !arg.contains("=") to distinguish from parameters. For comma-separated files, it tests new File(split[1]).exists() first, then splits on commas via split[1].split(",") if the combined path doesn't exist, adding each file as separate "in=" entries.
Statistical Consistency
Each file creates a new AssemblyStats2(args2) instance with identical parameter arrays, ensuring consistent statistical calculations. The wrapper preserves parameter consistency by maintaining args2 array contents while only modifying args2[0] for input file and args2[1] for header control across iterations.
Output Format Integration
The wrapper sets AssemblyStats2.overwrite=false and AssemblyStats2.append=true for subsequent files (i>0), enabling output file appending. It automatically adds "addname=t" to the default parameter array (alist) to enable file name columns for tabular output formats when processing multiple files.
Output Formats
When processing multiple files, STATSWRAPPER generates different output structures depending on the selected format:
Tabular Formats (3-6)
These formats produce a header row followed by one data row per input file. The header is printed once at the beginning, and each subsequent row contains statistics for one assembly file. When addname=t, the first column identifies the source file.
Human-Readable Formats (1-2, 7)
These formats produce separate statistical summaries for each input file, with clear file identification and formatted output suitable for manual review.
GC Content Output
When gc= is specified, per-scaffold GC content from all input files is written to a single output file, with scaffold names preserving source file context. The gcformat parameter controls the level of detail in this output.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org