ScalarIntervals

Script: scalarintervals.sh Package: scalar Class: ScalarIntervals.java

Calculates compositional scalar metrics from nucleotide sequence data and writes them per interval as a TSV file. Computes GC-independent metrics (HH, CAGA, strandedness, etc.) either globally or using a sliding window to characterize within-genome variance. Outputs mean and standard deviation for each metric to help identify compositional patterns and heterogeneity in genomic sequences.

Basic Usage

scalarintervals.sh in=<input file> out=<output file>

ScalarIntervals processes nucleotide sequences (FASTA or FASTQ) to calculate compositional metrics that are independent of GC content. The tool can analyze sequences globally or using sliding windows to capture within-genome variance.

Parameters

Parameters control input/output locations, windowing behavior, and taxonomic assignment.

Standard Parameters

in=<file>: Primary input; FASTA or FASTQ format. This can also be a directory or comma-delimited list. Filenames can also be used without in= prefix.
out=stdout: Set to a file to redirect TSV output. The mean and standard deviation will be printed to stderr.

Depth/Coverage Parameters

cov=<file> / coverage=<file> / covfile=<file>: Coverage file from pileup.sh or covmaker.sh. Provides per-base depth information for depth metric calculation. Format supports standard pileup output with contig ID and average fold coverage.
depth=<file> / depthfile=<file>: SAM or BAM file for depth calculation. Calculates coverage depth from aligned bases in the file, which is included in the output metrics.

Processing Parameters

header=f: Print a header line in the output TSV.
window=50000: If nonzero, calculate metrics over sliding windows. Otherwise calculate per contig. Larger windows have lower variance.
interval=10000: Generate a data point every this many bp. Controls the frequency of output records.
shred=-1: If positive, set window and interval to the same size. Convenient for non-overlapping window analysis.
break=t: Reset metrics at contig boundaries. When true, each contig is analyzed independently.
minlen=500: Minimum interval length to generate a data point. Shorter intervals are skipped.
maxreads=-1: Maximum number of reads/contigs to process. Default: unlimited (-1).
printname=f: Print contig names in output. Useful for tracking which contig each metric corresponds to.
printpos=f: Print contig position in output. Shows the genomic coordinate for each data point.
printtime=t: Print timing information to screen.
parsetid=f: Parse TaxIDs from file and sequence headers. Enables taxonomic labeling of sequences.
sketch=f: Use BBSketch (SendSketch) to assign taxonomy per contig. Sends sketches to remote server for taxonomic classification.
clade=f: Use QuickClade to assign taxonomy per contig. Sends clades to remote server for taxonomic classification.

Java Parameters

-Xmx: Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 800m (fixed allocation).
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Analysis with Sliding Windows

scalarintervals.sh in=ecoli.fasta out=data.tsv shred=5k

Analyze E. coli genome using 5kb non-overlapping windows (shred sets both window and interval to 5000).

Multiple Files with Custom Windows

scalarintervals.sh *.fa.gz out=data.tsv shred=5k

Process all compressed FASTA files in current directory with 5kb windows.

Global Analysis Per Contig

scalarintervals.sh in=genome.fasta out=metrics.tsv window=0

Calculate global metrics for each contig (no windowing).

Overlapping Windows with Position Tracking

scalarintervals.sh in=genome.fasta out=data.tsv window=50000 interval=10000 printpos=t printname=t

Use 50kb sliding windows with 10kb steps, printing contig names and positions for each data point.

Taxonomic Classification with BBSketch

scalarintervals.sh in=metagenome.fasta out=classified.tsv shred=10k sketch=t

Analyze metagenome with 10kb windows and assign taxonomy using BBSketch server.

QuickClade Analysis with Header

scalarintervals.sh in=viral_genomes.fasta out=analysis.tsv clade=t header=t

Classify viral genomes using QuickClade with column headers in output.

Algorithm Details

Compositional Metrics

ScalarIntervals calculates GC-independent compositional metrics using k-mer (specifically dimer) counting. These metrics capture sequence characteristics beyond simple GC content:

HH metric: Homopolymer/heteropolymer ratios
CAGA metric: Specific dinucleotide patterns
Strandedness: Strand bias indicators
Additional scalars: Derived compositional features

Processing Pipeline

Read Input: Load sequences from FASTA/FASTQ files (supports compression)
Dimer Tracking: Count dinucleotide frequencies using a KmerTracker with k=2
Window Analysis:
- If window > 0: Calculate metrics over sliding windows of specified size
- If window = 0: Calculate metrics globally per contig
Interval Output: Generate data points at specified intervals (every N bases)
Taxonomic Assignment (optional):
- sketch=t: Create MinHash sketches and send to BBSketch server
- clade=t: Generate clade signatures and send to QuickClade server
- parsetid=t: Extract taxonomy from sequence headers
Statistics: Calculate mean and standard deviation across all intervals

Windowing Behavior

The relationship between window and interval parameters:

window > interval: Overlapping windows (e.g., window=50000, interval=10000 gives 5x coverage)
window = interval: Non-overlapping windows (use shred parameter for convenience)
window = 0: No windowing; analyze entire contigs

Memory Requirements

Default allocation is 800MB, which is sufficient for most analyses. Memory usage scales with:

Number of concurrent threads (multithreaded processing)
Window size (larger windows require more buffering)
Taxonomic features (sketch/clade generation adds overhead)

Output Format

TSV output contains one row per interval with calculated scalar metrics. When header=t, column names are printed. Statistics (mean and standard deviation) for all metrics are printed to stderr for quick assessment of genome-wide characteristics.

Support

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.