CountGC

Basic Usage

countgc.sh in=<input> out=<output> format=<format>

Input may be stdin or a FASTA or FASTQ file, compressed or uncompressed. Output is optional and tab-delimited.

Parameters

CountGC has a minimal set of parameters focused on input/output specification and format control.

Input/Output Parameters

in=<file>: Input FASTA or FASTQ file, compressed or uncompressed. May also be "stdin" or "standardin" to read from standard input.
out=<file>: Output file for results. May be "stdout" or "standardout" to print to standard output. Special values: "summaryonly" or "none" (no per-sequence output), "benchmark" (performance testing mode).

Format Parameters

format=1: Output format: name, length, A, C, G, T, N, GC. Note that A+C+G+T=1 even when N is nonzero, as fractions are calculated from defined bases only.
format=2: Output format: name, GC (minimal output showing only sequence name and GC content).
format=4: Output format: name, length, GC (intermediate detail with sequence name, total length, and GC content).

Performance Parameters

benchmark=<t/f>: Enable benchmark mode for performance testing. When true, disables output generation and reports processing speed in MBytes/s. Default: false.

Examples

Basic GC Content Analysis

countgc.sh in=sequences.fasta out=gc_content.txt format=1

Analyzes GC content of sequences in FASTA file, outputting detailed nucleotide composition for each sequence.

Simple GC Content Report

countgc.sh in=reads.fastq out=gc_summary.txt format=2

Generates a minimal report with just sequence names and GC percentages from FASTQ input.

Compressed Input with Standard Output

countgc.sh in=assembly.fasta.gz out=stdout format=4

Processes compressed FASTA file and prints results to standard output in intermediate detail format.

Pipeline Processing

cat sequences.fasta | countgc.sh in=stdin out=stdout format=2

Processes sequences from standard input in a pipeline, useful for integration with other tools.

Performance Benchmarking

countgc.sh in=large_genome.fasta benchmark=true

Tests processing speed on large files without generating output, useful for performance evaluation.

Algorithm Details

32KB Buffered I/O Architecture

CountGC implements streaming file processing using fixed-size buffers defined in the source code:

Buffer Implementation: Uses 32768-byte arrays (final byte[] buf=new byte[32768]) for both FASTA and FASTQ processing
Single-pass Streaming: Processes files sequentially using while(lim>0) loops that read chunks via is.read(buf)
Format Detection: Uses FileFormat.testInput() method to automatically determine FASTA vs FASTQ format
State Machine Parsing: FASTQ processing uses integer mode variable (mode=0,1,2,3) for header/sequence/plus/quality state tracking

Nucleotide Counting Implementation

The tool uses a byte-to-index mapping system implemented in makeCharToNum():

Lookup Array: charToNum byte array maps ASCII values to nucleotide indices: A/a=0, C/c=1, G/g=2, T/t=3, N=4
Case Handling: Explicit mapping r['a']=r['A']=0; r['c']=r['C']=1; etc. for both upper and lowercase
Default Classification: Arrays.fill(r, (byte)4) sets unknown characters to index 4 (N category)
GC Calculation: Uses (counts[1]+counts[2])*inv1 formula where inv1=1f/max(1, sum1) and sum1 excludes N bases

Format-specific Processing Methods

CountGC contains separate parsing algorithms for different file formats:

FASTA Processing: countFasta() method uses hdmode boolean flag and carrot='>'' byte constant for header detection
FASTQ Processing: countFastq() method implements 4-state machine with at='@' byte constant and mode transitions
Output Generation: toString2() methods generate tab-delimited output using Tools.format() with 5-decimal precision

Memory Management and Performance

CountGC uses specific data structures for memory control:

Counter Arrays: KillSwitch.allocInt1D(6) creates integer arrays for A,C,G,T,N,other counts
Memory Footprint: Uses int[6] arrays for per-sequence counts and long[6] for overall totals
Streaming Design: Constant memory regardless of file size - only stores current buffer and counter arrays
TextStreamWriter: Uses TextStreamWriter class for output buffering when writing to files

Output Format Specifications

Format 1: Complete Nucleotide Analysis

sequence_name	length	A_fraction	C_fraction	G_fraction	T_fraction	N_fraction	GC_fraction

Provides comprehensive nucleotide composition where ACGT fractions sum to 1.0 even when N bases are present.

Format 2: Minimal GC Report

sequence_name	GC_fraction

Streamlined output showing only sequence identifiers and GC content percentages.

Format 4: Intermediate Detail

sequence_name	length	GC_fraction

Balanced output including sequence length information along with GC content.

Performance Reporting Implementation

CountGC tracks processing metrics using Timer class and specific calculation methods:

Timing: Uses shared.Timer class to track elapsed time via t.elapsed millisecond counter
File Speed: Calculates raw speed as bytes*1000d/t.elapsed using File.length() for compressed input size
Base Speed: Computes uncompressed speed as Vector.sum(counts)*1000d/t.elapsed for total nucleotides processed
Overall Stats: Uses toString2() method with overall long[6] array to report aggregate nucleotide composition

Technical Notes

File Format Implementation

FASTA Processing: Detects carrot='>' (ASCII 62) byte constant and hdmode boolean state for header parsing
FASTQ Processing: Uses at='@' (ASCII 64) byte constant with 4-state machine transitions (mode 0-3)
Compression Support: Uses ReadWrite.getInputStream() method with gzip detection for transparent decompression
Stream I/O: Handles System.in/System.out through InputStream interface with "stdin"/"stdout" string matching

Error Handling Implementation

File Validation: Checks f.exists() and f.isDirectory() before processing with descriptive RuntimeException messages
Format Validation: Throws RuntimeException for invalid FORMAT values (must be 1, 2, or 4)
Input Validation: Validates non-null input parameter with "No input file." RuntimeException
Stream Handling: Uses try-catch blocks around is.read() and is.close() operations with IOException handling

Memory Requirements Implementation

CountGC uses specific memory allocation patterns defined in the source code:

JVM Heap: Default z="-Xmx120m" sets maximum heap size to 120MB via shell script configuration
I/O Buffer: byte[32768] arrays for file reading, totaling 32KB per processing thread
Counter Storage: int[6] arrays via KillSwitch.allocInt1D(6) for per-sequence counts (24 bytes)
Overall Totals: long[6] arrays for aggregate statistics (48 bytes)
Character Mapping: Static byte[256] charToNum lookup table (256 bytes)

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org