CloudPlot

Script: cloudplot.sh Package: scalar Class: CloudPlot.java

Visualizes 3D compositional metrics (GC, HH, CAGA) as 2D scatter plots. Supports both TSV interval data and FASTA input (via ScalarIntervals). Generates PNG images with configurable scaling and point sizes. Useful for analyzing genomic composition patterns, detecting contamination, and visualizing sequence heterogeneity across contigs or genomic windows.

Basic Usage

cloudplot.sh in=<input file> out=<output file>

Example Commands

# TSV input with pre-computed metrics
cloudplot.sh in=data.tsv out=plot.png

# FASTA input with automatic metric calculation
cloudplot.sh in=ecoli.fasta out=plot.png shred=5k

CloudPlot creates scatter plot visualizations where compositional metrics are mapped to 2D space with color encoding. The tool accepts either TSV files with pre-computed GC/HH/CAGA values or FASTA/FASTQ sequences for automatic metric calculation over sliding windows or whole contigs.

Parameters

Parameters control input/output, rendering appearance, taxonomic coloring, decorrelation adjustments, and sequence processing.

Standard Parameters

in=<file>: Primary input; TSV (GC/HH/CAGA columns) or FASTA/FASTQ. TSV files should contain tab-delimited compositional metrics. FASTA files will be processed through ScalarIntervals to calculate metrics.
out=<file>: Output PNG image file.

Rendering Parameters

order=caga,hh,gc: Plotting order of dimensions as x,y,z. Determines which metric is mapped to horizontal axis (x), vertical axis (y), and rotation/color (z). Can specify as dimension names (gc, hh, caga) or indices (0, 1, 2).
scale=1: Image scale multiplier (1=1024x768). Increases both dimensions and point sizes proportionally. Use larger values for high-resolution output.
pointsize=3.5: Width of plotted points in pixels. Points are rendered as elongated ellipses with rotation based on z-axis value.
autoscale=t: Autoscale dimensions with negative values based on data. If false, they will be scaled to 0-1. Autoscaling uses percentile-based bounds to exclude outliers.
xmin=-1: X-axis minimum. Negative values trigger autoscaling based on data distribution.
xmax=-1: X-axis maximum. Negative values trigger autoscaling based on data distribution.
ymin=-1: Y-axis minimum. Negative values trigger autoscaling based on data distribution.
ymax=-1: Y-axis maximum. Negative values trigger autoscaling based on data distribution.
zmin=-1: Z-axis (rotation/color) minimum. Negative values trigger autoscaling based on data distribution.
zmax=-1: Z-axis (rotation/color) maximum. Negative values trigger autoscaling based on data distribution.
xpct=0.998: Percentile of x-axis values to use for autoscaling. Excludes extreme outliers from axis range calculation.
ypct=0.998: Percentile of y-axis values to use for autoscaling. Excludes extreme outliers from axis range calculation.
zpct=0.99: Percentile of z-axis values to use for autoscaling. Excludes extreme outliers from axis range calculation.

Taxonomy/Coloring Parameters

colorbytax=f: Color by taxonomy. Default coloring is by z-axis value (compositional metric). When enabled, points are colored based on taxonomic assignment instead of metric values.
colorbyname=f: Color by contig name, so points on the same contig have the same, random color. Useful for visualizing within-contig vs between-contig variation.
level=: Raise taxonomy to this level before assigning color. Requires a taxonomic tree. e.g. 'level=genus'. See https://sourceforge.net/projects/bbmap/files/Resources/ for taxonomic resources.
parsetid=f: Parse TaxIDs from file and sequence headers. Extracts NCBI taxonomy identifiers embedded in FASTA headers.
sketch=f: Use BBSketch (SendSketch) to assign taxonomy per contig. Performs k-mer based taxonomic classification.
clade=f: Use QuickClade to assign taxonomy per contig. Performs alignment-based phylogenetic placement.

Decorrelation Parameters

decorrelate=t: Modify plotted data to reduce inter-dimension correlation. Improves visual separation of data points by reducing natural correlation between compositional metrics.
GChh=-0.5: Correlation between GC and HH. Defines expected correlation direction for decorrelation adjustment.
GChhs=0.2: (GChhStrength) Modify HH by -GChhs*GC*GChh. Controls strength of GC-to-HH decorrelation adjustment.
hhGCs=1.4: (hhGCStrength) Modify GC by -hhGCs*hh*GChh. Controls strength of HH-to-GC decorrelation adjustment.
GCcaga=0.1: Correlation between GC and CAGA. Defines expected correlation direction for decorrelation adjustment.
GCcagas=0.5: (GCcagaStrength) Modify CAGA by -GCcagas*GC*GCcaga. Controls strength of GC-to-CAGA decorrelation adjustment.
cagaGCs=0.0: (cagaGCStrength) Modify GC by -cagaGCs*caga*GCcaga. Controls strength of CAGA-to-GC decorrelation adjustment.

Sequence Processing Parameters (FASTA input only)

window=50000: If nonzero, calculate metrics over sliding windows. Otherwise calculate per contig. Sliding windows reveal intra-contig compositional variation.
interval=10000: Generate a data point every this many bp. Controls sliding window step size. Smaller intervals produce more points but increase processing time.
shred=-1: If positive, set window and interval to the same size. Creates non-overlapping tiles of specified size.
break=t: Reset metrics at contig boundaries. Prevents windows from spanning multiple contigs.
minlen=500: Minimum interval length to generate a point. Filters out short windows that may have unreliable metrics.
maxreads=-1: Maximum number of reads/contigs to process. Useful for testing or sampling large datasets.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 2g.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic TSV Input

cloudplot.sh in=metrics.tsv out=basic_plot.png

Generate scatter plot from pre-computed compositional metrics in TSV format.

FASTA Input with Shredding

cloudplot.sh in=assembly.fasta out=composition.png shred=5k

Calculate metrics over 5kb non-overlapping windows across all contigs.

High-Resolution with Custom Order

cloudplot.sh in=genome.fasta out=hires.png scale=2 order=gc,caga,hh shred=10k

Generate 2048x1536 image with GC on x-axis, CAGA on y-axis, and HH controlling rotation/color.

Taxonomic Coloring

cloudplot.sh in=metagenome.fasta out=taxa_plot.png sketch=t colorbytax=t level=genus

Color points by genus-level taxonomy using BBSketch for classification.

Sliding Window Analysis

cloudplot.sh in=chromosome.fasta out=sliding.png window=100k interval=10k

Create overlapping 100kb windows with 10kb step size to show fine-scale compositional variation.

Contig-Level Coloring

cloudplot.sh in=contigs.fasta out=by_contig.png colorbyname=t shred=5k

Assign random colors per contig to visualize intra-contig clustering patterns.

Algorithm Details

Compositional Metrics

GC Content: Fraction of G+C bases (range 0-1)
HH (Homopolymer): Frequency of homopolymer runs, measuring sequence complexity
CAGA: Normalized dinucleotide frequency metric sensitive to codon usage and replication patterns

Processing Pipeline

Input Reading:
- TSV: Direct loading of pre-computed metrics from tab-delimited columns
- FASTA: Sequence processing through ScalarIntervals with sliding windows or whole-contig calculation
Decorrelation (if enabled): Apply linear transformations to reduce natural correlation between metrics, improving visual separation
Autoscaling: Calculate percentile-based bounds for each axis to exclude outliers while preserving data range
Rendering:
- Map x,y,z values to pixel coordinates based on axis ranges
- Draw elongated ellipses with rotation determined by z-value
- Apply color based on z-value (default) or taxonomic assignment
Output: Write BufferedImage as PNG file

Color Encoding

Default color scheme (z-axis metric) uses a spectral gradient:

0.0-0.2: Red → Purple
0.2-0.4: Purple → Blue
0.4-0.6: Blue → Cyan
0.6-0.8: Cyan → Green
0.8-1.0: Green → Yellow

Taxonomic coloring assigns random but consistent colors based on TaxID hash values.

Point Rendering

Points are rendered as elongated ellipses (aspect ratio ~4:1) rotated based on the z-axis value (0 to 2π radians). Point length increases slightly with y-axis position to create depth perception. This encoding allows visualization of three dimensions on a 2D plot.

Memory Requirements

Memory usage is proportional to the number of data points. For FASTA input with sliding windows, memory depends on window overlap. Default 2GB allocation is sufficient for most genomes with typical window settings. Increase for very large metagenomes or small intervals generating millions of points.

Use Cases

Contamination Detection: Distinct compositional clusters may indicate contaminating sequences
Metagenomic Binning: Visual assessment of sequence heterogeneity for bin refinement
Horizontal Gene Transfer: Atypical composition patterns within genomes
Assembly Quality: Compositional consistency across contigs
Phylogenetic Diversity: Taxonomic coloring reveals compositional space occupied by different lineages

Support

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.