Scalars

Script: scalars.sh Package: scalar Class: Scalars.java

Calculates compositional scalar metrics from nucleotide sequence data. Computes GC-independent metrics (HH, CAGA, strandedness, etc.) either globally or using a sliding window to characterize within-genome variance. Outputs mean and standard deviation for each metric.

Basic Usage

scalars.sh in=<input file> out=<output file>

Scalars analyzes nucleotide sequence composition and prints the averages for each input file. When using windowed mode, also prints standard deviation of each file.

Parameters

Parameters control input/output locations and analysis mode (global vs windowed).

Standard Parameters

in=<file>: Primary input; FASTA or FASTQ format. This can also be a directory or comma-delimited list. Filenames can also be used without in= prefix.
out=stdout: Set to a file to redirect output. Default: stdout (prints to console).

Processing Parameters

header=f: Print a column header line. Default: false.
rowheader=f: Print a row header for each output row. Default: false.
window=0: If nonzero, calculate and average over sliding windows of this size. Default: 0 (global analysis mode). When set, computes mean and standard deviation across all windows.
break=f: Set to true to break data at contig boundaries in windowed mode. Prevents windows from spanning multiple contigs. Default: false.
raw=f: Output raw dinucleotide frequencies instead of computed metrics. Default: false.

Java Parameters

-Xmx: Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 800m (fixed allocation).
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Output Format

Compositional Metrics

Scalars computes the following compositional metrics based on dinucleotide frequencies:

GC: GC content
STR: Strandedness
HH: HH dinucleotide-based metric
PP: PP dinucleotide-based metric
AAAT: AAAT dinucleotide-based metric
CCCG: CCCG dinucleotide-based metric
HMH: HMH dinucleotide-based metric
HHPP: HHPP dinucleotide-based metric
ACTG: ACTG dinucleotide-based metric
ACAG: ACAG dinucleotide-based metric
CAGA: CAGA dinucleotide-based metric
CCMCG: CCMCG dinucleotide-based metric
ATMTA: ATMTA dinucleotide-based metric
AT: AT content

Global Mode (window=0)

GC      STR     HH      PP      AAAT    CCCG    HMH     HHPP    ACTG    ACAG    CAGA    CCMCG   ATMTA   AT
0.512   0.003   0.256   0.244   0.183   0.187   0.251   0.500   0.501   0.252   0.248   0.245   0.255   0.488

One row per input file with mean values for each metric.

Windowed Mode (window>0)

GC      STR     HH      PP      AAAT    CCCG    HMH     HHPP    ACTG    ACAG    CAGA    CCMCG   ATMTA   AT
Mean    0.512   0.003   0.256   0.244   0.183   0.187   0.251   0.500   0.501   0.252   0.248   0.245   0.255   0.488
STDev   0.045   0.012   0.028   0.026   0.032   0.034   0.029   0.041   0.038   0.031   0.029   0.033   0.030   0.045

Two rows per input file: mean and standard deviation across all windows.

Examples

Global Analysis

scalars.sh in=genome.fasta

Calculate compositional metrics for entire genome, output to console.

Windowed Analysis

scalars.sh in=genome.fasta out=metrics.txt window=10000 header=t

Compute metrics using 10kb sliding windows, output mean and standard deviation with column headers.

Multiple Files

scalars.sh in=genome1.fasta,genome2.fasta,genome3.fasta out=comparison.txt header=t rowheader=t

Process multiple genomes and create comparison table with headers.

Directory Processing

scalars.sh in=genomes_directory/ out=all_metrics.txt window=5000 header=t

Process all FASTA/FASTQ files in directory with 5kb windowed analysis.

Contig-Aware Windowing

scalars.sh in=assembly.fasta window=1000 break=t header=t out=metrics.txt

Calculate metrics with 1kb windows that don't span contig boundaries.

Algorithm Details

Processing Pipeline

Global Mode (window=0):

Dimer Counting: Accumulate all dinucleotide frequencies across entire input
Metric Calculation: Compute each compositional metric from dimer counts
Output: Print single row of mean values for all metrics

Windowed Mode (window>0):

Sliding Window: For each base in input:
- Update window with new base (removing oldest if window full)
- When window is valid (full window size), calculate all 14 metrics
- Add each metric value to corresponding histogram (1025 bins, 0.0-1.0 range)
Statistical Analysis: Calculate mean and standard deviation from histograms
Output: Print two rows - mean values and standard deviations

Computational Details

Dimer-Based Metrics: Efficiently counts dinucleotide frequencies across sequence
Histogram Resolution: High precision for variance analysis in windowed mode
Window Management: Sliding window updates incrementally without recalculating entire window
Strand Handling: Counts dinucleotides from both forward and reverse strands (AA/TT combined, etc.)

Memory Requirements

Memory usage is minimal for global mode (only dimer counts). Windowed mode requires approximately 0.8MB for histograms (14 metrics × 1025 bins × 8 bytes). Default 800MB allocation handles large genomes efficiently.

Use Cases

Genome Comparison: Compare compositional signatures across species
Contamination Detection: Identify regions with anomalous composition
Quality Assessment: Detect compositional biases in sequencing data
GC-Independent Analysis: Characterize genomes beyond simple GC content
Within-Genome Variance: Use windowed mode to identify compositionally distinct regions

Support

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.