ScalarIntervals

Script: scalarintervals.sh Package: scalar Class: ScalarIntervals.java

Calculates compositional scalar metrics from nucleotide sequence data and writes them periodically as a TSV file. Computes GC-independent metrics (HH, CAGA, strandedness, etc.) either globally or using a sliding window to characterize within-genome variance. Outputs mean and standard deviation for each metric to help identify compositional patterns and heterogeneity in genomic sequences.

Basic Usage

scalarintervals.sh in=<input file> out=<output file>

ScalarIntervals processes nucleotide sequences (FASTA or FASTQ) to calculate compositional metrics that are independent of GC content. The tool can analyze sequences globally or using sliding windows to capture within-genome variance.

Parameters

Parameters control input/output locations, windowing behavior, and taxonomic assignment.

Standard Parameters

in=<file>
Primary input; FASTA or FASTQ format. This can also be a directory or comma-delimited list. Filenames can also be used without in= prefix.
out=stdout
Set to a file to redirect TSV output. The mean and standard deviation will be printed to stderr.

Processing Parameters

header=f
Print a header line in the output TSV.
window=50000
If nonzero, calculate metrics over sliding windows. Otherwise calculate per contig. Larger windows have lower variance.
interval=10000
Generate a data point every this many bp. Controls the frequency of output records.
shred=-1
If positive, set window and interval to the same size. Convenient for non-overlapping window analysis.
break=t
Reset metrics at contig boundaries. When true, each contig is analyzed independently.
minlen=500
Minimum interval length to generate a data point. Shorter intervals are skipped.
maxreads=-1
Maximum number of reads/contigs to process. Default: unlimited (-1).
printname=f
Print contig names in output. Useful for tracking which contig each metric corresponds to.
printpos=f
Print contig position in output. Shows the genomic coordinate for each data point.
printtime=t
Print timing information to screen.
parsetid=f
Parse TaxIDs from file and sequence headers. Enables taxonomic labeling of sequences.
sketch=f
Use BBSketch (SendSketch) to assign taxonomy per contig. Sends sketches to remote server for taxonomic classification.
clade=f
Use QuickClade to assign taxonomy per contig. Sends clades to remote server for taxonomic classification.

Java Parameters

-Xmx
Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 800m (fixed allocation).
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Analysis with Sliding Windows

scalarintervals.sh in=ecoli.fasta out=data.tsv shred=5k

Analyze E. coli genome using 5kb non-overlapping windows (shred sets both window and interval to 5000).

Multiple Files with Custom Windows

scalarintervals.sh *.fa.gz out=data.tsv shred=5k

Process all compressed FASTA files in current directory with 5kb windows.

Global Analysis Per Contig

scalarintervals.sh in=genome.fasta out=metrics.tsv window=0

Calculate global metrics for each contig (no windowing).

Overlapping Windows with Position Tracking

scalarintervals.sh in=genome.fasta out=data.tsv window=50000 interval=10000 printpos=t printname=t

Use 50kb sliding windows with 10kb steps, printing contig names and positions for each data point.

Taxonomic Classification with BBSketch

scalarintervals.sh in=metagenome.fasta out=classified.tsv shred=10k sketch=t

Analyze metagenome with 10kb windows and assign taxonomy using BBSketch server.

QuickClade Analysis with Header

scalarintervals.sh in=viral_genomes.fasta out=analysis.tsv clade=t header=t

Classify viral genomes using QuickClade with column headers in output.

Algorithm Details

Compositional Metrics

ScalarIntervals calculates GC-independent compositional metrics using k-mer (specifically dimer) counting. These metrics capture sequence characteristics beyond simple GC content:

Processing Pipeline

  1. Read Input: Load sequences from FASTA/FASTQ files (supports compression)
  2. Dimer Tracking: Count dinucleotide frequencies using a KmerTracker with k=2
  3. Window Analysis:
    • If window > 0: Calculate metrics over sliding windows of specified size
    • If window = 0: Calculate metrics globally per contig
  4. Interval Output: Generate data points at specified intervals (every N bases)
  5. Taxonomic Assignment (optional):
    • sketch=t: Create MinHash sketches and send to BBSketch server
    • clade=t: Generate clade signatures and send to QuickClade server
    • parsetid=t: Extract taxonomy from sequence headers
  6. Statistics: Calculate mean and standard deviation across all intervals

Windowing Behavior

The relationship between window and interval parameters:

Memory Requirements

Default allocation is 800MB, which is sufficient for most analyses. Memory usage scales with:

Output Format

TSV output contains one row per interval with calculated scalar metrics. When header=t, column names are printed. Statistics (mean and standard deviation) for all metrics are printed to stderr for quick assessment of genome-wide characteristics.

Support

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.