PlotGC
Prints sequence gc content once per interval.
Basic Usage
plotgc.sh in=<input file> out=<output file>
PlotGC analyzes sequence GC content by dividing sequences into fixed-size intervals and calculating the GC percentage for each interval. The output is a tab-delimited file containing position and GC content information.
Parameters
Parameters control input/output files, interval settings, and position calculations.
Input/Output Parameters
- in=<file>
- Input file. Accepts FASTA or FASTQ format (compressed or uncompressed). Use in=stdin.fa to pipe from stdin.
- out=<file>
- Output file for tab-delimited GC content data. Use out=stdout to pipe to stdout. Default: stdout.txt
Analysis Parameters
- interval=1000
- Interval length in base pairs. The sequence is divided into intervals of this size for GC calculation.
- offset=0
- Position offset for coordinates. Use offset=1 for 1-based indexing instead of the default 0-based indexing.
- psb=t
- (printshortbins) Print GC content for the last bin of a contig even when shorter than the specified interval length. Set to false to skip incomplete final intervals.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding automatic memory detection. Examples: -Xmx20g for 20 gigabytes, -Xmx200m for 200 megabytes. The maximum is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Output Format
The output is a tab-delimited file with the following columns:
- name: Sequence name/ID
- interval: Length of the interval (actual bases counted)
- start: Start position within the sequence (0-based or 1-based depending on offset)
- stop: Stop position within the sequence
- runningStart: Cumulative start position across all sequences
- runningStop: Cumulative stop position across all sequences
- gc: GC content as a decimal fraction (0.0 to 1.0)
Examples
Basic GC Content Analysis
plotgc.sh in=genome.fasta out=gc_content.txt
Calculate GC content in 1000bp intervals for a genome assembly.
Custom Interval Size
plotgc.sh in=sequences.fq out=gc_plot.txt interval=500
Use 500bp intervals instead of the default 1000bp for higher resolution analysis.
1-based Coordinates
plotgc.sh in=contigs.fa out=gc_analysis.txt offset=1 psb=f
Use 1-based coordinate system and skip incomplete final intervals.
Pipeline Integration
cat sequences.fasta | plotgc.sh in=stdin.fa out=stdout | head -20
Process sequences from stdin and display the first 20 lines of GC data.
Algorithm Details
PlotGC implements a straightforward sliding window approach for GC content calculation:
GC Calculation Method
- Nucleotide Counting: Uses AminoAcid.baseToNumber array to convert bases to numeric indices (A=0, C=1, G=2, T=3)
- AT vs GC Classification: AT content = A + T counts, GC content = G + C counts
- Percentage Calculation: GC% = GC_count / (AT_count + GC_count), with minimum denominator of 1 to avoid division by zero
- Ambiguous Bases: N and other ambiguous nucleotides are ignored in the calculation
Interval Processing
- Fixed Intervals: Sequences are divided into non-overlapping intervals of the specified size
- Position Tracking: Maintains both sequence-relative and cumulative position coordinates
- Incomplete Intervals: The final interval of each sequence may be shorter than the specified interval size
- Optional Short Bins: When printShortBins=true, incomplete final intervals are included in the output
Memory Efficiency
- Streaming Processing: Processes sequences in chunks without loading entire files into memory
- Minimal State: Only maintains nucleotide counts for the current interval
- Array Reuse: ACGT count array is reset and reused for each interval to minimize memory allocation
Output Precision
GC content is formatted to 3 decimal places (e.g., 0.423 for 42.3% GC content). Position coordinates are adjusted by the offset parameter, allowing for both 0-based (default) and 1-based coordinate systems.
Performance Considerations
- Memory Usage: Very low memory footprint due to streaming processing approach
- Processing Speed: Fast single-pass algorithm with O(n) time complexity where n is total sequence length
- File Size Handling: Can process arbitrarily large sequence files due to streaming design
- I/O Efficiency: Uses buffered readers and writers for optimal disk access patterns
Common Use Cases
- Genome Composition Analysis: Identify GC-rich and AT-rich regions in genomes
- Sequence Quality Assessment: Detect compositional biases in sequencing data
- Comparative Genomics: Compare GC content patterns between different genomes or regions
- Data Visualization Preparation: Generate data for plotting GC content landscapes
- Contamination Detection: Identify sequences with unusual GC content that may indicate contamination
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org