CloudPlot

Script: cloudplot.sh Package: scalar Class: CloudPlot.java

Visualizes 5D compositional metrics (GC, HH, CAGA + size and color dimensions) as 2D scatter plots. Supports both TSV interval data and FASTA input (via ScalarIntervals). Generates PNG images with configurable scaling and point sizes. Useful for analyzing genomic composition patterns, detecting contamination, and visualizing sequence heterogeneity across contigs or genomic windows.

Basic Usage

cloudplot.sh in=<input file> out=<output file>

Example Commands

# TSV input with pre-computed metrics
cloudplot.sh in=data.tsv out=plot.png

# FASTA input with automatic metric calculation
cloudplot.sh in=ecoli.fasta out=plot.png shred=5k

CloudPlot creates scatter plot visualizations where compositional metrics are mapped to 2D space with color encoding. The tool accepts either TSV files with pre-computed GC/HH/CAGA values or FASTA/FASTQ sequences for automatic metric calculation over sliding windows or whole contigs.

Parameters

Parameters control input/output, rendering appearance, taxonomic coloring, decorrelation adjustments, and sequence processing.

Standard Parameters

in=<file>
Primary input; TSV (GC/HH/CAGA columns) or FASTA/FASTQ. TSV files should contain tab-delimited compositional metrics. FASTA files will be processed through ScalarIntervals to calculate metrics.
out=<file>
Output PNG image file.

Coverage/Depth Parameters

cov=<file>
Coverage file from pileup.sh or covmaker.sh. Used to color or size points by sequencing depth.
coverage=<file>
Alias for cov parameter. Coverage file for depth-based coloring or sizing.
covfile=<file>
Alias for cov parameter. Coverage file for depth-based coloring or sizing.
depth=<file>
SAM/BAM file for depth calculation. Calculates coverage from alignment data for coloring or sizing points.
depthfile=<file>
Alias for depth parameter. SAM/BAM file for depth calculation.
sam=<file>
Alias for depth parameter. SAM file for depth-based point sizing or coloring.
bam=<file>
Alias for depth parameter. BAM file for depth-based point sizing or coloring.

Rendering Parameters

order=hh,caga,gc
Plotting order of dimensions as x,y,z. Determines which metric is mapped to horizontal axis (x), vertical axis (y), and rotation/color (z). Can specify as dimension names (gc, hh, caga) or indices (0, 1, 2). Default: HH on x-axis, CAGA on y-axis, GC controlling rotation/color.
scale=1
Image scale multiplier (1=1024x768). Increases both dimensions and point sizes proportionally. Use larger values for high-resolution output.
pointsize=3.5
Width of plotted points in pixels. Points are rendered as elongated ellipses with rotation based on z-axis value.

Size Parameters

smin=<float>
Minimum point size. Default: 0.8 * pointsize. Smallest points will not shrink below this size.
smax=<float>
Maximum point size. Default: 3.0 * pointsize. Largest points will not grow beyond this size.
spct=0.998
Size percentile for autoscaling (0-1). Excludes extreme size values from scaling range.

Autoscaling Parameters

autoscale=t
Autoscale dimensions with negative values based on data. If false, they will be scaled to 0-1. Autoscaling uses percentile-based bounds to exclude outliers.
xmin=-1
X-axis minimum. Negative values trigger autoscaling based on data distribution.
xmax=-1
X-axis maximum. Negative values trigger autoscaling based on data distribution.
ymin=-1
Y-axis minimum. Negative values trigger autoscaling based on data distribution.
ymax=-1
Y-axis maximum. Negative values trigger autoscaling based on data distribution.
zmin=-1
Z-axis (rotation/color) minimum. Negative values trigger autoscaling based on data distribution.
zmax=-1
Z-axis (rotation/color) maximum. Negative values trigger autoscaling based on data distribution.
xpct=0.998
Percentile of x-axis values to use for autoscaling. Excludes extreme outliers from axis range calculation.
ypct=0.998
Percentile of y-axis values to use for autoscaling. Excludes extreme outliers from axis range calculation.
zpct=0.99
Percentile of z-axis values to use for autoscaling. Excludes extreme outliers from axis range calculation.

Color Parameters

colorby=taxonomy
Metric used for coloring points. Options: gc, hh, caga, depth, length, taxonomy. Default: taxonomy.
cpct=0.98
Percentile for color autoscaling (0-1). Excludes extreme color values from scaling range.

Taxonomy/Coloring Parameters

colorbytax=f
Color by taxonomy. Default coloring is by z-axis value (compositional metric). When enabled, points are colored based on taxonomic assignment instead of metric values.
colorbyname=f
Color by contig name, so points on the same contig have the same, random color. Useful for visualizing within-contig vs between-contig variation.
level=
Raise taxonomy to this level before assigning color. Requires a taxonomic tree. e.g. 'level=genus'. See https://sourceforge.net/projects/bbmap/files/Resources/ for taxonomic resources.
parsetid=f
Parse TaxIDs from file and sequence headers. Extracts NCBI taxonomy identifiers embedded in FASTA headers.
sketch=f
Use BBSketch (SendSketch) to assign taxonomy per contig. Performs k-mer based taxonomic classification.
clade=f
Use QuickClade to assign taxonomy per contig. Performs alignment-based phylogenetic placement.

Decorrelation Parameters

decorrelate=t
Modify plotted data to reduce inter-dimension correlation. Improves visual separation of data points by reducing natural correlation between compositional metrics.
GChh=-0.5
Correlation between GC and HH. Defines expected correlation direction for decorrelation adjustment.
GChhs=0.2
(GChhStrength) Modify HH by -GChhs*GC*GChh. Controls strength of GC-to-HH decorrelation adjustment.
hhGCs=1.4
(hhGCStrength) Modify GC by -hhGCs*hh*GChh. Controls strength of HH-to-GC decorrelation adjustment.
GCcaga=0.1
Correlation between GC and CAGA. Defines expected correlation direction for decorrelation adjustment.
GCcagas=0.5
(GCcagaStrength) Modify CAGA by -GCcagas*GC*GCcaga. Controls strength of GC-to-CAGA decorrelation adjustment.
cagaGCs=0.0
(cagaGCStrength) Modify GC by -cagaGCs*caga*GCcaga. Controls strength of CAGA-to-GC decorrelation adjustment.

Log Transform Parameters

logoffset=0.25
Offset added before log-transforming depth/length values. Prevents log(0) errors for zero-valued data points.
logshift=2.0
Shift applied during log transformation. Controls the scale of log-transformed values.
logpower=2.0
Power exponent for log transformation. Applies power function to control compression strength of log scaling.

Sequence Processing Parameters (FASTA input only)

window=50000
If nonzero, calculate metrics over sliding windows. Otherwise calculate per contig. Sliding windows reveal intra-contig compositional variation.
interval=10000
Generate a data point every this many bp. Controls sliding window step size. Smaller intervals produce more points but increase processing time.
shred=-1
If positive, set window and interval to the same size. Creates non-overlapping tiles of specified size.
break=t
Reset metrics at contig boundaries. Prevents windows from spanning multiple contigs.
minlen=500
Minimum interval length to generate a point. Filters out short windows that may have unreliable metrics.
maxreads=-1
Maximum number of reads/contigs to process. Useful for testing or sampling large datasets.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 2g.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic TSV Input

cloudplot.sh in=metrics.tsv out=basic_plot.png

Generate scatter plot from pre-computed compositional metrics in TSV format.

FASTA Input with Shredding

cloudplot.sh in=assembly.fasta out=composition.png shred=5k

Calculate metrics over 5kb non-overlapping windows across all contigs.

High-Resolution with Custom Order

cloudplot.sh in=genome.fasta out=hires.png scale=2 order=gc,caga,hh shred=10k

Generate 2048x1536 image with GC on x-axis, CAGA on y-axis, and HH controlling rotation/color.

Taxonomic Coloring

cloudplot.sh in=metagenome.fasta out=taxa_plot.png sketch=t colorbytax=t level=genus

Color points by genus-level taxonomy using BBSketch for classification.

Sliding Window Analysis

cloudplot.sh in=chromosome.fasta out=sliding.png window=100k interval=10k

Create overlapping 100kb windows with 10kb step size to show fine-scale compositional variation.

Contig-Level Coloring

cloudplot.sh in=contigs.fasta out=by_contig.png colorbyname=t shred=5k

Assign random colors per contig to visualize intra-contig clustering patterns.

Algorithm Details

Compositional Metrics

Processing Pipeline

  1. Input Reading:
    • TSV: Direct loading of pre-computed metrics from tab-delimited columns
    • FASTA: Sequence processing through ScalarIntervals with sliding windows or whole-contig calculation
  2. Decorrelation (if enabled): Apply linear transformations to reduce natural correlation between metrics, improving visual separation
  3. Autoscaling: Calculate percentile-based bounds for each axis to exclude outliers while preserving data range
  4. Rendering:
    • Map x,y,z values to pixel coordinates based on axis ranges
    • Draw elongated ellipses with rotation determined by z-value
    • Apply color based on z-value (default) or taxonomic assignment
  5. Output: Write BufferedImage as PNG file

Color Encoding

Default color scheme (z-axis metric) uses a spectral gradient:

Taxonomic coloring assigns random but consistent colors based on TaxID hash values.

Point Rendering

Points are rendered as elongated ellipses (aspect ratio ~4:1) rotated based on the z-axis value (0 to 2π radians). Point length increases slightly with y-axis position to create depth perception. This encoding allows visualization of three dimensions on a 2D plot.

Memory Requirements

Memory usage is proportional to the number of data points. For FASTA input with sliding windows, memory depends on window overlap. Default 2GB allocation is sufficient for most genomes with typical window settings. Increase for very large metagenomes or small intervals generating millions of points.

Use Cases

Support

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.