CountGC
Counts GC content of reads or scaffolds in FASTA or FASTQ files with 32KB buffered I/O processing and multiple output formats.
Basic Usage
countgc.sh in=<input> out=<output> format=<format>
Input may be stdin or a FASTA or FASTQ file, compressed or uncompressed. Output is optional and tab-delimited.
Parameters
CountGC has a minimal set of parameters focused on input/output specification and format control.
Input/Output Parameters
- in=<file>
- Input FASTA or FASTQ file, compressed or uncompressed. May also be "stdin" or "standardin" to read from standard input.
- out=<file>
- Output file for results. May be "stdout" or "standardout" to print to standard output. Special values: "summaryonly" or "none" (no per-sequence output), "benchmark" (performance testing mode).
Format Parameters
- format=1
- Output format: name, length, A, C, G, T, N, GC. Note that A+C+G+T=1 even when N is nonzero, as fractions are calculated from defined bases only.
- format=2
- Output format: name, GC (minimal output showing only sequence name and GC content).
- format=4
- Output format: name, length, GC (intermediate detail with sequence name, total length, and GC content).
Performance Parameters
- benchmark=<t/f>
- Enable benchmark mode for performance testing. When true, disables output generation and reports processing speed in MBytes/s. Default: false.
Examples
Basic GC Content Analysis
countgc.sh in=sequences.fasta out=gc_content.txt format=1
Analyzes GC content of sequences in FASTA file, outputting detailed nucleotide composition for each sequence.
Simple GC Content Report
countgc.sh in=reads.fastq out=gc_summary.txt format=2
Generates a minimal report with just sequence names and GC percentages from FASTQ input.
Compressed Input with Standard Output
countgc.sh in=assembly.fasta.gz out=stdout format=4
Processes compressed FASTA file and prints results to standard output in intermediate detail format.
Pipeline Processing
cat sequences.fasta | countgc.sh in=stdin out=stdout format=2
Processes sequences from standard input in a pipeline, useful for integration with other tools.
Performance Benchmarking
countgc.sh in=large_genome.fasta benchmark=true
Tests processing speed on large files without generating output, useful for performance evaluation.
Algorithm Details
32KB Buffered I/O Architecture
CountGC implements streaming file processing using fixed-size buffers defined in the source code:
- Buffer Implementation: Uses 32768-byte arrays (final byte[] buf=new byte[32768]) for both FASTA and FASTQ processing
- Single-pass Streaming: Processes files sequentially using while(lim>0) loops that read chunks via is.read(buf)
- Format Detection: Uses FileFormat.testInput() method to automatically determine FASTA vs FASTQ format
- State Machine Parsing: FASTQ processing uses integer mode variable (mode=0,1,2,3) for header/sequence/plus/quality state tracking
Nucleotide Counting Implementation
The tool uses a byte-to-index mapping system implemented in makeCharToNum():
- Lookup Array: charToNum byte array maps ASCII values to nucleotide indices: A/a=0, C/c=1, G/g=2, T/t=3, N=4
- Case Handling: Explicit mapping r['a']=r['A']=0; r['c']=r['C']=1; etc. for both upper and lowercase
- Default Classification: Arrays.fill(r, (byte)4) sets unknown characters to index 4 (N category)
- GC Calculation: Uses (counts[1]+counts[2])*inv1 formula where inv1=1f/max(1, sum1) and sum1 excludes N bases
Format-specific Processing Methods
CountGC contains separate parsing algorithms for different file formats:
- FASTA Processing: countFasta() method uses hdmode boolean flag and carrot='>'' byte constant for header detection
- FASTQ Processing: countFastq() method implements 4-state machine with at='@' byte constant and mode transitions
- Output Generation: toString2() methods generate tab-delimited output using Tools.format() with 5-decimal precision
Memory Management and Performance
CountGC uses specific data structures for memory control:
- Counter Arrays: KillSwitch.allocInt1D(6) creates integer arrays for A,C,G,T,N,other counts
- Memory Footprint: Uses int[6] arrays for per-sequence counts and long[6] for overall totals
- Streaming Design: Constant memory regardless of file size - only stores current buffer and counter arrays
- TextStreamWriter: Uses TextStreamWriter class for output buffering when writing to files
Output Format Specifications
Format 1: Complete Nucleotide Analysis
sequence_name length A_fraction C_fraction G_fraction T_fraction N_fraction GC_fraction
Provides comprehensive nucleotide composition where ACGT fractions sum to 1.0 even when N bases are present.
Format 2: Minimal GC Report
sequence_name GC_fraction
Streamlined output showing only sequence identifiers and GC content percentages.
Format 4: Intermediate Detail
sequence_name length GC_fraction
Balanced output including sequence length information along with GC content.
Performance Reporting Implementation
CountGC tracks processing metrics using Timer class and specific calculation methods:
- Timing: Uses shared.Timer class to track elapsed time via t.elapsed millisecond counter
- File Speed: Calculates raw speed as bytes*1000d/t.elapsed using File.length() for compressed input size
- Base Speed: Computes uncompressed speed as Vector.sum(counts)*1000d/t.elapsed for total nucleotides processed
- Overall Stats: Uses toString2() method with overall long[6] array to report aggregate nucleotide composition
Technical Notes
File Format Implementation
- FASTA Processing: Detects carrot='>' (ASCII 62) byte constant and hdmode boolean state for header parsing
- FASTQ Processing: Uses at='@' (ASCII 64) byte constant with 4-state machine transitions (mode 0-3)
- Compression Support: Uses ReadWrite.getInputStream() method with gzip detection for transparent decompression
- Stream I/O: Handles System.in/System.out through InputStream interface with "stdin"/"stdout" string matching
Error Handling Implementation
- File Validation: Checks f.exists() and f.isDirectory() before processing with descriptive RuntimeException messages
- Format Validation: Throws RuntimeException for invalid FORMAT values (must be 1, 2, or 4)
- Input Validation: Validates non-null input parameter with "No input file." RuntimeException
- Stream Handling: Uses try-catch blocks around is.read() and is.close() operations with IOException handling
Memory Requirements Implementation
CountGC uses specific memory allocation patterns defined in the source code:
- JVM Heap: Default z="-Xmx120m" sets maximum heap size to 120MB via shell script configuration
- I/O Buffer: byte[32768] arrays for file reading, totaling 32KB per processing thread
- Counter Storage: int[6] arrays via KillSwitch.allocInt1D(6) for per-sequence counts (24 bytes)
- Overall Totals: long[6] arrays for aggregate statistics (48 bytes)
- Character Mapping: Static byte[256] charToNum lookup table (256 bytes)
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org