CountGC

Script: countgc.sh Package: jgi Class: CountGC.java

Counts GC content of reads or scaffolds in FASTA or FASTQ files with 32KB buffered I/O processing and multiple output formats.

Basic Usage

countgc.sh in=<input> out=<output> format=<format>

Input may be stdin or a FASTA or FASTQ file, compressed or uncompressed. Output is optional and tab-delimited.

Parameters

CountGC has a minimal set of parameters focused on input/output specification and format control.

Input/Output Parameters

in=<file>
Input FASTA or FASTQ file, compressed or uncompressed. May also be "stdin" or "standardin" to read from standard input.
out=<file>
Output file for results. May be "stdout" or "standardout" to print to standard output. Special values: "summaryonly" or "none" (no per-sequence output), "benchmark" (performance testing mode).

Format Parameters

format=1
Output format: name, length, A, C, G, T, N, GC. Note that A+C+G+T=1 even when N is nonzero, as fractions are calculated from defined bases only.
format=2
Output format: name, GC (minimal output showing only sequence name and GC content).
format=4
Output format: name, length, GC (intermediate detail with sequence name, total length, and GC content).

Performance Parameters

benchmark=<t/f>
Enable benchmark mode for performance testing. When true, disables output generation and reports processing speed in MBytes/s. Default: false.

Examples

Basic GC Content Analysis

countgc.sh in=sequences.fasta out=gc_content.txt format=1

Analyzes GC content of sequences in FASTA file, outputting detailed nucleotide composition for each sequence.

Simple GC Content Report

countgc.sh in=reads.fastq out=gc_summary.txt format=2

Generates a minimal report with just sequence names and GC percentages from FASTQ input.

Compressed Input with Standard Output

countgc.sh in=assembly.fasta.gz out=stdout format=4

Processes compressed FASTA file and prints results to standard output in intermediate detail format.

Pipeline Processing

cat sequences.fasta | countgc.sh in=stdin out=stdout format=2

Processes sequences from standard input in a pipeline, useful for integration with other tools.

Performance Benchmarking

countgc.sh in=large_genome.fasta benchmark=true

Tests processing speed on large files without generating output, useful for performance evaluation.

Algorithm Details

32KB Buffered I/O Architecture

CountGC implements streaming file processing using fixed-size buffers defined in the source code:

Nucleotide Counting Implementation

The tool uses a byte-to-index mapping system implemented in makeCharToNum():

Format-specific Processing Methods

CountGC contains separate parsing algorithms for different file formats:

Memory Management and Performance

CountGC uses specific data structures for memory control:

Output Format Specifications

Format 1: Complete Nucleotide Analysis

sequence_name	length	A_fraction	C_fraction	G_fraction	T_fraction	N_fraction	GC_fraction

Provides comprehensive nucleotide composition where ACGT fractions sum to 1.0 even when N bases are present.

Format 2: Minimal GC Report

sequence_name	GC_fraction

Streamlined output showing only sequence identifiers and GC content percentages.

Format 4: Intermediate Detail

sequence_name	length	GC_fraction

Balanced output including sequence length information along with GC content.

Performance Reporting Implementation

CountGC tracks processing metrics using Timer class and specific calculation methods:

Technical Notes

File Format Implementation

Error Handling Implementation

Memory Requirements Implementation

CountGC uses specific memory allocation patterns defined in the source code:

Support

For questions and support: