PlotGC

Basic Usage

plotgc.sh in=<input file> out=<output file>

PlotGC analyzes sequence GC content by dividing sequences into fixed-size intervals and calculating the GC percentage for each interval. The output is a tab-delimited file containing position and GC content information.

Parameters

Parameters control input/output files, interval settings, and position calculations.

Input/Output Parameters

in=<file>: Input file. Accepts FASTA or FASTQ format (compressed or uncompressed). Use in=stdin.fa to pipe from stdin.
out=<file>: Output file for tab-delimited GC content data. Use out=stdout to pipe to stdout. Default: stdout.txt

Analysis Parameters

interval=1000: Interval length in base pairs. The sequence is divided into intervals of this size for GC calculation.
offset=0: Position offset for coordinates. Use offset=1 for 1-based indexing instead of the default 0-based indexing.
psb=t: (printshortbins) Print GC content for the last bin of a contig even when shorter than the specified interval length. Set to false to skip incomplete final intervals.

Java Parameters

-Xmx: Set Java's memory usage, overriding automatic memory detection. Examples: -Xmx20g for 20 gigabytes, -Xmx200m for 200 megabytes. The maximum is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Output Format

The output is a tab-delimited file with the following columns:

name: Sequence name/ID
interval: Length of the interval (actual bases counted)
start: Start position within the sequence (0-based or 1-based depending on offset)
stop: Stop position within the sequence
runningStart: Cumulative start position across all sequences
runningStop: Cumulative stop position across all sequences
gc: GC content as a decimal fraction (0.0 to 1.0)

Examples

Basic GC Content Analysis

plotgc.sh in=genome.fasta out=gc_content.txt

Calculate GC content in 1000bp intervals for a genome assembly.

Custom Interval Size

plotgc.sh in=sequences.fq out=gc_plot.txt interval=500

Use 500bp intervals instead of the default 1000bp for higher resolution analysis.

1-based Coordinates

plotgc.sh in=contigs.fa out=gc_analysis.txt offset=1 psb=f

Use 1-based coordinate system and skip incomplete final intervals.

Pipeline Integration

cat sequences.fasta | plotgc.sh in=stdin.fa out=stdout | head -20

Process sequences from stdin and display the first 20 lines of GC data.

Algorithm Details

PlotGC implements a straightforward sliding window approach for GC content calculation:

GC Calculation Method

Nucleotide Counting: Uses AminoAcid.baseToNumber array to convert bases to numeric indices (A=0, C=1, G=2, T=3)
AT vs GC Classification: AT content = A + T counts, GC content = G + C counts
Percentage Calculation: GC% = GC_count / (AT_count + GC_count), with minimum denominator of 1 to avoid division by zero
Ambiguous Bases: N and other ambiguous nucleotides are ignored in the calculation

Interval Processing

Fixed Intervals: Sequences are divided into non-overlapping intervals of the specified size
Position Tracking: Maintains both sequence-relative and cumulative position coordinates
Incomplete Intervals: The final interval of each sequence may be shorter than the specified interval size
Optional Short Bins: When printShortBins=true, incomplete final intervals are included in the output

Memory Efficiency

Streaming Processing: Processes sequences in chunks without loading entire files into memory
Minimal State: Only maintains nucleotide counts for the current interval
Array Reuse: ACGT count array is reset and reused for each interval to minimize memory allocation

Output Precision

GC content is formatted to 3 decimal places (e.g., 0.423 for 42.3% GC content). Position coordinates are adjusted by the offset parameter, allowing for both 0-based (default) and 1-based coordinate systems.

Performance Considerations

Memory Usage: Very low memory footprint due to streaming processing approach
Processing Speed: Fast single-pass algorithm with O(n) time complexity where n is total sequence length
File Size Handling: Can process arbitrarily large sequence files due to streaming design
I/O Efficiency: Uses buffered readers and writers for optimal disk access patterns

Common Use Cases

Genome Composition Analysis: Identify GC-rich and AT-rich regions in genomes
Sequence Quality Assessment: Detect compositional biases in sequencing data
Comparative Genomics: Compare GC content patterns between different genomes or regions
Data Visualization Preparation: Generate data for plotting GC content landscapes
Contamination Detection: Identify sequences with unusual GC content that may indicate contamination

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org