ScalarIntervals
Calculates compositional scalar metrics from nucleotide sequence data and writes them periodically as a TSV file. Computes GC-independent metrics (HH, CAGA, strandedness, etc.) either globally or using a sliding window to characterize within-genome variance. Outputs mean and standard deviation for each metric to help identify compositional patterns and heterogeneity in genomic sequences.
Basic Usage
scalarintervals.sh in=<input file> out=<output file>
ScalarIntervals processes nucleotide sequences (FASTA or FASTQ) to calculate compositional metrics that are independent of GC content. The tool can analyze sequences globally or using sliding windows to capture within-genome variance.
Parameters
Parameters control input/output locations, windowing behavior, and taxonomic assignment.
Standard Parameters
- in=<file>
- Primary input; FASTA or FASTQ format. This can also be a directory or comma-delimited list. Filenames can also be used without in= prefix.
- out=stdout
- Set to a file to redirect TSV output. The mean and standard deviation will be printed to stderr.
Processing Parameters
- header=f
- Print a header line in the output TSV.
- window=50000
- If nonzero, calculate metrics over sliding windows. Otherwise calculate per contig. Larger windows have lower variance.
- interval=10000
- Generate a data point every this many bp. Controls the frequency of output records.
- shred=-1
- If positive, set window and interval to the same size. Convenient for non-overlapping window analysis.
- break=t
- Reset metrics at contig boundaries. When true, each contig is analyzed independently.
- minlen=500
- Minimum interval length to generate a data point. Shorter intervals are skipped.
- maxreads=-1
- Maximum number of reads/contigs to process. Default: unlimited (-1).
- printname=f
- Print contig names in output. Useful for tracking which contig each metric corresponds to.
- printpos=f
- Print contig position in output. Shows the genomic coordinate for each data point.
- printtime=t
- Print timing information to screen.
- parsetid=f
- Parse TaxIDs from file and sequence headers. Enables taxonomic labeling of sequences.
- sketch=f
- Use BBSketch (SendSketch) to assign taxonomy per contig. Sends sketches to remote server for taxonomic classification.
- clade=f
- Use QuickClade to assign taxonomy per contig. Sends clades to remote server for taxonomic classification.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 800m (fixed allocation).
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Analysis with Sliding Windows
scalarintervals.sh in=ecoli.fasta out=data.tsv shred=5k
Analyze E. coli genome using 5kb non-overlapping windows (shred sets both window and interval to 5000).
Multiple Files with Custom Windows
scalarintervals.sh *.fa.gz out=data.tsv shred=5k
Process all compressed FASTA files in current directory with 5kb windows.
Global Analysis Per Contig
scalarintervals.sh in=genome.fasta out=metrics.tsv window=0
Calculate global metrics for each contig (no windowing).
Overlapping Windows with Position Tracking
scalarintervals.sh in=genome.fasta out=data.tsv window=50000 interval=10000 printpos=t printname=t
Use 50kb sliding windows with 10kb steps, printing contig names and positions for each data point.
Taxonomic Classification with BBSketch
scalarintervals.sh in=metagenome.fasta out=classified.tsv shred=10k sketch=t
Analyze metagenome with 10kb windows and assign taxonomy using BBSketch server.
QuickClade Analysis with Header
scalarintervals.sh in=viral_genomes.fasta out=analysis.tsv clade=t header=t
Classify viral genomes using QuickClade with column headers in output.
Algorithm Details
Compositional Metrics
ScalarIntervals calculates GC-independent compositional metrics using k-mer (specifically dimer) counting. These metrics capture sequence characteristics beyond simple GC content:
- HH metric: Homopolymer/heteropolymer ratios
- CAGA metric: Specific dinucleotide patterns
- Strandedness: Strand bias indicators
- Additional scalars: Derived compositional features
Processing Pipeline
- Read Input: Load sequences from FASTA/FASTQ files (supports compression)
- Dimer Tracking: Count dinucleotide frequencies using a KmerTracker with k=2
- Window Analysis:
- If window > 0: Calculate metrics over sliding windows of specified size
- If window = 0: Calculate metrics globally per contig
- Interval Output: Generate data points at specified intervals (every N bases)
- Taxonomic Assignment (optional):
- sketch=t: Create MinHash sketches and send to BBSketch server
- clade=t: Generate clade signatures and send to QuickClade server
- parsetid=t: Extract taxonomy from sequence headers
- Statistics: Calculate mean and standard deviation across all intervals
Windowing Behavior
The relationship between window and interval parameters:
- window > interval: Overlapping windows (e.g., window=50000, interval=10000 gives 5x coverage)
- window = interval: Non-overlapping windows (use shred parameter for convenience)
- window = 0: No windowing; analyze entire contigs
Memory Requirements
Default allocation is 800MB, which is sufficient for most analyses. Memory usage scales with:
- Number of concurrent threads (multithreaded processing)
- Window size (larger windows require more buffering)
- Taxonomic features (sketch/clade generation adds overhead)
Output Format
TSV output contains one row per interval with calculated scalar metrics. When header=t, column names are printed. Statistics (mean and standard deviation) for all metrics are printed to stderr for quick assessment of genome-wide characteristics.
Support
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.