Scalars

Script: scalars.sh Package: scalar Class: Scalars.java

Calculates compositional scalar metrics from nucleotide sequence data. Computes GC-independent metrics (HH, CAGA, strandedness, etc.) either globally or using a sliding window to characterize within-genome variance. Outputs mean and standard deviation for each metric.

Basic Usage

scalars.sh in=<input file> out=<output file>

Scalars analyzes nucleotide sequence composition and prints the averages for each input file. When using windowed mode, also prints standard deviation of each file.

Parameters

Parameters control input/output locations and analysis mode (global vs windowed).

Standard Parameters

in=<file>
Primary input; FASTA or FASTQ format. This can also be a directory or comma-delimited list. Filenames can also be used without in= prefix.
out=stdout
Set to a file to redirect output. Default: stdout (prints to console).

Processing Parameters

header=f
Print a column header line. Default: false.
rowheader=f
Print a row header for each output row. Default: false.
window=0
If nonzero, calculate and average over sliding windows of this size. Default: 0 (global analysis mode). When set, computes mean and standard deviation across all windows.
break=f
Set to true to break data at contig boundaries in windowed mode. Prevents windows from spanning multiple contigs. Default: false.
raw=f
Output raw dinucleotide frequencies instead of computed metrics. Default: false.

Java Parameters

-Xmx
Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 800m (fixed allocation).
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Output Format

Compositional Metrics

Scalars computes the following compositional metrics based on dinucleotide frequencies:

Global Mode (window=0)

GC      STR     HH      PP      AAAT    CCCG    HMH     HHPP    ACTG    ACAG    CAGA    CCMCG   ATMTA   AT
0.512   0.003   0.256   0.244   0.183   0.187   0.251   0.500   0.501   0.252   0.248   0.245   0.255   0.488

One row per input file with mean values for each metric.

Windowed Mode (window>0)

GC      STR     HH      PP      AAAT    CCCG    HMH     HHPP    ACTG    ACAG    CAGA    CCMCG   ATMTA   AT
Mean    0.512   0.003   0.256   0.244   0.183   0.187   0.251   0.500   0.501   0.252   0.248   0.245   0.255   0.488
STDev   0.045   0.012   0.028   0.026   0.032   0.034   0.029   0.041   0.038   0.031   0.029   0.033   0.030   0.045

Two rows per input file: mean and standard deviation across all windows.

Examples

Global Analysis

scalars.sh in=genome.fasta

Calculate compositional metrics for entire genome, output to console.

Windowed Analysis

scalars.sh in=genome.fasta out=metrics.txt window=10000 header=t

Compute metrics using 10kb sliding windows, output mean and standard deviation with column headers.

Multiple Files

scalars.sh in=genome1.fasta,genome2.fasta,genome3.fasta out=comparison.txt header=t rowheader=t

Process multiple genomes and create comparison table with headers.

Directory Processing

scalars.sh in=genomes_directory/ out=all_metrics.txt window=5000 header=t

Process all FASTA/FASTQ files in directory with 5kb windowed analysis.

Contig-Aware Windowing

scalars.sh in=assembly.fasta window=1000 break=t header=t out=metrics.txt

Calculate metrics with 1kb windows that don't span contig boundaries.

Algorithm Details

Processing Pipeline

Global Mode (window=0):

  1. Dimer Counting: Accumulate all dinucleotide frequencies across entire input
  2. Metric Calculation: Compute each compositional metric from dimer counts
  3. Output: Print single row of mean values for all metrics

Windowed Mode (window>0):

  1. Sliding Window: For each base in input:
    • Update window with new base (removing oldest if window full)
    • When window is valid (full window size), calculate all 14 metrics
    • Add each metric value to corresponding histogram (1025 bins, 0.0-1.0 range)
  2. Statistical Analysis: Calculate mean and standard deviation from histograms
  3. Output: Print two rows - mean values and standard deviations

Computational Details

Memory Requirements

Memory usage is minimal for global mode (only dimer counts). Windowed mode requires approximately 0.8MB for histograms (14 metrics × 1025 bins × 8 bytes). Default 800MB allocation handles large genomes efficiently.

Use Cases

Support

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.