CloudPlot
Visualizes 3D compositional metrics (GC, HH, CAGA) as 2D scatter plots. Supports both TSV interval data and FASTA input (via ScalarIntervals). Generates PNG images with configurable scaling and point sizes. Useful for analyzing genomic composition patterns, detecting contamination, and visualizing sequence heterogeneity across contigs or genomic windows.
Basic Usage
cloudplot.sh in=<input file> out=<output file>
Example Commands
# TSV input with pre-computed metrics
cloudplot.sh in=data.tsv out=plot.png
# FASTA input with automatic metric calculation
cloudplot.sh in=ecoli.fasta out=plot.png shred=5k
CloudPlot creates scatter plot visualizations where compositional metrics are mapped to 2D space with color encoding. The tool accepts either TSV files with pre-computed GC/HH/CAGA values or FASTA/FASTQ sequences for automatic metric calculation over sliding windows or whole contigs.
Parameters
Parameters control input/output, rendering appearance, taxonomic coloring, decorrelation adjustments, and sequence processing.
Standard Parameters
- in=<file>
- Primary input; TSV (GC/HH/CAGA columns) or FASTA/FASTQ. TSV files should contain tab-delimited compositional metrics. FASTA files will be processed through ScalarIntervals to calculate metrics.
- out=<file>
- Output PNG image file.
Rendering Parameters
- order=caga,hh,gc
- Plotting order of dimensions as x,y,z. Determines which metric is mapped to horizontal axis (x), vertical axis (y), and rotation/color (z). Can specify as dimension names (gc, hh, caga) or indices (0, 1, 2).
- scale=1
- Image scale multiplier (1=1024x768). Increases both dimensions and point sizes proportionally. Use larger values for high-resolution output.
- pointsize=3.5
- Width of plotted points in pixels. Points are rendered as elongated ellipses with rotation based on z-axis value.
- autoscale=t
- Autoscale dimensions with negative values based on data. If false, they will be scaled to 0-1. Autoscaling uses percentile-based bounds to exclude outliers.
- xmin=-1
- X-axis minimum. Negative values trigger autoscaling based on data distribution.
- xmax=-1
- X-axis maximum. Negative values trigger autoscaling based on data distribution.
- ymin=-1
- Y-axis minimum. Negative values trigger autoscaling based on data distribution.
- ymax=-1
- Y-axis maximum. Negative values trigger autoscaling based on data distribution.
- zmin=-1
- Z-axis (rotation/color) minimum. Negative values trigger autoscaling based on data distribution.
- zmax=-1
- Z-axis (rotation/color) maximum. Negative values trigger autoscaling based on data distribution.
- xpct=0.998
- Percentile of x-axis values to use for autoscaling. Excludes extreme outliers from axis range calculation.
- ypct=0.998
- Percentile of y-axis values to use for autoscaling. Excludes extreme outliers from axis range calculation.
- zpct=0.99
- Percentile of z-axis values to use for autoscaling. Excludes extreme outliers from axis range calculation.
Taxonomy/Coloring Parameters
- colorbytax=f
- Color by taxonomy. Default coloring is by z-axis value (compositional metric). When enabled, points are colored based on taxonomic assignment instead of metric values.
- colorbyname=f
- Color by contig name, so points on the same contig have the same, random color. Useful for visualizing within-contig vs between-contig variation.
- level=
- Raise taxonomy to this level before assigning color. Requires a taxonomic tree. e.g. 'level=genus'. See https://sourceforge.net/projects/bbmap/files/Resources/ for taxonomic resources.
- parsetid=f
- Parse TaxIDs from file and sequence headers. Extracts NCBI taxonomy identifiers embedded in FASTA headers.
- sketch=f
- Use BBSketch (SendSketch) to assign taxonomy per contig. Performs k-mer based taxonomic classification.
- clade=f
- Use QuickClade to assign taxonomy per contig. Performs alignment-based phylogenetic placement.
Decorrelation Parameters
- decorrelate=t
- Modify plotted data to reduce inter-dimension correlation. Improves visual separation of data points by reducing natural correlation between compositional metrics.
- GChh=-0.5
- Correlation between GC and HH. Defines expected correlation direction for decorrelation adjustment.
- GChhs=0.2
- (GChhStrength) Modify HH by -GChhs*GC*GChh. Controls strength of GC-to-HH decorrelation adjustment.
- hhGCs=1.4
- (hhGCStrength) Modify GC by -hhGCs*hh*GChh. Controls strength of HH-to-GC decorrelation adjustment.
- GCcaga=0.1
- Correlation between GC and CAGA. Defines expected correlation direction for decorrelation adjustment.
- GCcagas=0.5
- (GCcagaStrength) Modify CAGA by -GCcagas*GC*GCcaga. Controls strength of GC-to-CAGA decorrelation adjustment.
- cagaGCs=0.0
- (cagaGCStrength) Modify GC by -cagaGCs*caga*GCcaga. Controls strength of CAGA-to-GC decorrelation adjustment.
Sequence Processing Parameters (FASTA input only)
- window=50000
- If nonzero, calculate metrics over sliding windows. Otherwise calculate per contig. Sliding windows reveal intra-contig compositional variation.
- interval=10000
- Generate a data point every this many bp. Controls sliding window step size. Smaller intervals produce more points but increase processing time.
- shred=-1
- If positive, set window and interval to the same size. Creates non-overlapping tiles of specified size.
- break=t
- Reset metrics at contig boundaries. Prevents windows from spanning multiple contigs.
- minlen=500
- Minimum interval length to generate a point. Filters out short windows that may have unreliable metrics.
- maxreads=-1
- Maximum number of reads/contigs to process. Useful for testing or sampling large datasets.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 2g.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic TSV Input
cloudplot.sh in=metrics.tsv out=basic_plot.png
Generate scatter plot from pre-computed compositional metrics in TSV format.
FASTA Input with Shredding
cloudplot.sh in=assembly.fasta out=composition.png shred=5k
Calculate metrics over 5kb non-overlapping windows across all contigs.
High-Resolution with Custom Order
cloudplot.sh in=genome.fasta out=hires.png scale=2 order=gc,caga,hh shred=10k
Generate 2048x1536 image with GC on x-axis, CAGA on y-axis, and HH controlling rotation/color.
Taxonomic Coloring
cloudplot.sh in=metagenome.fasta out=taxa_plot.png sketch=t colorbytax=t level=genus
Color points by genus-level taxonomy using BBSketch for classification.
Sliding Window Analysis
cloudplot.sh in=chromosome.fasta out=sliding.png window=100k interval=10k
Create overlapping 100kb windows with 10kb step size to show fine-scale compositional variation.
Contig-Level Coloring
cloudplot.sh in=contigs.fasta out=by_contig.png colorbyname=t shred=5k
Assign random colors per contig to visualize intra-contig clustering patterns.
Algorithm Details
Compositional Metrics
- GC Content: Fraction of G+C bases (range 0-1)
- HH (Homopolymer): Frequency of homopolymer runs, measuring sequence complexity
- CAGA: Normalized dinucleotide frequency metric sensitive to codon usage and replication patterns
Processing Pipeline
- Input Reading:
- TSV: Direct loading of pre-computed metrics from tab-delimited columns
- FASTA: Sequence processing through ScalarIntervals with sliding windows or whole-contig calculation
- Decorrelation (if enabled): Apply linear transformations to reduce natural correlation between metrics, improving visual separation
- Autoscaling: Calculate percentile-based bounds for each axis to exclude outliers while preserving data range
- Rendering:
- Map x,y,z values to pixel coordinates based on axis ranges
- Draw elongated ellipses with rotation determined by z-value
- Apply color based on z-value (default) or taxonomic assignment
- Output: Write BufferedImage as PNG file
Color Encoding
Default color scheme (z-axis metric) uses a spectral gradient:
- 0.0-0.2: Red → Purple
- 0.2-0.4: Purple → Blue
- 0.4-0.6: Blue → Cyan
- 0.6-0.8: Cyan → Green
- 0.8-1.0: Green → Yellow
Taxonomic coloring assigns random but consistent colors based on TaxID hash values.
Point Rendering
Points are rendered as elongated ellipses (aspect ratio ~4:1) rotated based on the z-axis value (0 to 2π radians). Point length increases slightly with y-axis position to create depth perception. This encoding allows visualization of three dimensions on a 2D plot.
Memory Requirements
Memory usage is proportional to the number of data points. For FASTA input with sliding windows, memory depends on window overlap. Default 2GB allocation is sufficient for most genomes with typical window settings. Increase for very large metagenomes or small intervals generating millions of points.
Use Cases
- Contamination Detection: Distinct compositional clusters may indicate contaminating sequences
- Metagenomic Binning: Visual assessment of sequence heterogeneity for bin refinement
- Horizontal Gene Transfer: Atypical composition patterns within genomes
- Assembly Quality: Compositional consistency across contigs
- Phylogenetic Diversity: Taxonomic coloring reveals compositional space occupied by different lineages
Support
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.