Scalars
Calculates compositional scalar metrics from nucleotide sequence data. Computes GC-independent metrics (HH, CAGA, strandedness, etc.) either globally or using a sliding window to characterize within-genome variance. Outputs mean and standard deviation for each metric.
Basic Usage
scalars.sh in=<input file> out=<output file>
Scalars analyzes nucleotide sequence composition and prints the averages for each input file. When using windowed mode, also prints standard deviation of each file.
Parameters
Parameters control input/output locations and analysis mode (global vs windowed).
Standard Parameters
- in=<file>
- Primary input; FASTA or FASTQ format. This can also be a directory or comma-delimited list. Filenames can also be used without in= prefix.
- out=stdout
- Set to a file to redirect output. Default: stdout (prints to console).
Processing Parameters
- header=f
- Print a column header line. Default: false.
- rowheader=f
- Print a row header for each output row. Default: false.
- window=0
- If nonzero, calculate and average over sliding windows of this size. Default: 0 (global analysis mode). When set, computes mean and standard deviation across all windows.
- break=f
- Set to true to break data at contig boundaries in windowed mode. Prevents windows from spanning multiple contigs. Default: false.
- raw=f
- Output raw dinucleotide frequencies instead of computed metrics. Default: false.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 800m (fixed allocation).
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Output Format
Compositional Metrics
Scalars computes the following compositional metrics based on dinucleotide frequencies:
- GC: GC content
- STR: Strandedness
- HH: HH dinucleotide-based metric
- PP: PP dinucleotide-based metric
- AAAT: AAAT dinucleotide-based metric
- CCCG: CCCG dinucleotide-based metric
- HMH: HMH dinucleotide-based metric
- HHPP: HHPP dinucleotide-based metric
- ACTG: ACTG dinucleotide-based metric
- ACAG: ACAG dinucleotide-based metric
- CAGA: CAGA dinucleotide-based metric
- CCMCG: CCMCG dinucleotide-based metric
- ATMTA: ATMTA dinucleotide-based metric
- AT: AT content
Global Mode (window=0)
GC STR HH PP AAAT CCCG HMH HHPP ACTG ACAG CAGA CCMCG ATMTA AT
0.512 0.003 0.256 0.244 0.183 0.187 0.251 0.500 0.501 0.252 0.248 0.245 0.255 0.488
One row per input file with mean values for each metric.
Windowed Mode (window>0)
GC STR HH PP AAAT CCCG HMH HHPP ACTG ACAG CAGA CCMCG ATMTA AT
Mean 0.512 0.003 0.256 0.244 0.183 0.187 0.251 0.500 0.501 0.252 0.248 0.245 0.255 0.488
STDev 0.045 0.012 0.028 0.026 0.032 0.034 0.029 0.041 0.038 0.031 0.029 0.033 0.030 0.045
Two rows per input file: mean and standard deviation across all windows.
Examples
Global Analysis
scalars.sh in=genome.fasta
Calculate compositional metrics for entire genome, output to console.
Windowed Analysis
scalars.sh in=genome.fasta out=metrics.txt window=10000 header=t
Compute metrics using 10kb sliding windows, output mean and standard deviation with column headers.
Multiple Files
scalars.sh in=genome1.fasta,genome2.fasta,genome3.fasta out=comparison.txt header=t rowheader=t
Process multiple genomes and create comparison table with headers.
Directory Processing
scalars.sh in=genomes_directory/ out=all_metrics.txt window=5000 header=t
Process all FASTA/FASTQ files in directory with 5kb windowed analysis.
Contig-Aware Windowing
scalars.sh in=assembly.fasta window=1000 break=t header=t out=metrics.txt
Calculate metrics with 1kb windows that don't span contig boundaries.
Algorithm Details
Processing Pipeline
Global Mode (window=0):
- Dimer Counting: Accumulate all dinucleotide frequencies across entire input
- Metric Calculation: Compute each compositional metric from dimer counts
- Output: Print single row of mean values for all metrics
Windowed Mode (window>0):
- Sliding Window: For each base in input:
- Update window with new base (removing oldest if window full)
- When window is valid (full window size), calculate all 14 metrics
- Add each metric value to corresponding histogram (1025 bins, 0.0-1.0 range)
- Statistical Analysis: Calculate mean and standard deviation from histograms
- Output: Print two rows - mean values and standard deviations
Computational Details
- Dimer-Based Metrics: Efficiently counts dinucleotide frequencies across sequence
- Histogram Resolution: High precision for variance analysis in windowed mode
- Window Management: Sliding window updates incrementally without recalculating entire window
- Strand Handling: Counts dinucleotides from both forward and reverse strands (AA/TT combined, etc.)
Memory Requirements
Memory usage is minimal for global mode (only dimer counts). Windowed mode requires approximately 0.8MB for histograms (14 metrics × 1025 bins × 8 bytes). Default 800MB allocation handles large genomes efficiently.
Use Cases
- Genome Comparison: Compare compositional signatures across species
- Contamination Detection: Identify regions with anomalous composition
- Quality Assessment: Detect compositional biases in sequencing data
- GC-Independent Analysis: Characterize genomes beyond simple GC content
- Within-Genome Variance: Use windowed mode to identify compositionally distinct regions
Support
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.