CG2Illumina

Script: cg2illumina.sh Package: hiseq Class: BGI2Illumina.java

Converts BGI/Complete Genomics reads to Illumina header format, and optionally appends barcodes/indexes. For example, @E200008112L1C001R00100063962/1 would become @E200008112:0:FC:1:6396:1:1 1:N:0:

Basic Usage

cg2illumina.sh in=<input file> out=<output file> barcode=<string>

Input may be fasta or fastq, compressed or uncompressed.

Parameters

Parameters are organized into standard input/output options and BGI-specific processing parameters.

Standard parameters

in=<file>: Primary input, or read 1 input.
in2=<file>: Read 2 input if reads are in two files.
out=<file>: Primary output, or read 1 output.
out2=<file>: Read 2 output if reads are in two files.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Processing parameters

barcode=: (index) Optionally append a barcode to the header. If specified, the barcode string is added to the end of the converted Illumina header after the colon following the control bits.
parseextra=f: Set this to true if the reads headers have comments delimited by a whitespace. When enabled, comments following the header ID are preserved and appended to the converted header after a tab character.

Examples

Basic Conversion

cg2illumina.sh in=bgi_reads.fastq out=illumina_reads.fastq

Converts BGI format headers to Illumina format without adding barcodes.

Conversion with Barcode

cg2illumina.sh in=bgi_reads.fastq out=illumina_reads.fastq barcode=ACGTACGT

Converts headers and appends the specified barcode sequence to each header.

Paired-End Files

cg2illumina.sh in=bgi_R1.fastq in2=bgi_R2.fastq out=illumina_R1.fastq out2=illumina_R2.fastq

Processes paired-end reads from separate files, maintaining pair information in the converted headers.

With Comment Parsing

cg2illumina.sh in=bgi_reads.fastq out=illumina_reads.fastq parseextra=t

Parses and preserves comments that appear after the header ID, separated by whitespace.

Algorithm Details

Header Conversion Strategy

CG2Illumina uses a reverse parsing strategy implemented in the BGIHeaderParser2 class to handle BGI/Complete Genomics header formats that have variable prefixes. The parser uses a LineParserS4Reverse with delimiter pattern "_LCR/" to extract components from BGI headers.

BGI Header Format

The tool recognizes several BGI header formats:

v300056266_run28L3C001R0010057888/1
20A_V100002704L1C001R012000000/1
E200008112L1C001R00100063962/1

Illumina Header Mapping

The conversion maps BGI header components to standard Illumina format fields:

Machine Name: Always set to "CG" (Complete Genomics)
Run Number: Always set to 0
Flowcell ID: Extracted from the first parsed component, defaults to "FC" if unavailable
Lane Number: Extracted from the third component (L position)
Tile Number: Extracted from the fifth component, digits 3-10
X Position: Extracted from the fourth component (C position)
Y Position: Extracted from the fifth component, first 3 digits
Pair Code: Extracted from the sixth component (read number)
Chastity: Always set to 'N' (not filtered)
Control Bits: Always set to 0

Performance Characteristics

The tool is designed for efficient processing of large FASTQ files:

Uses concurrent read streams for parallel I/O processing
Supports compressed input and output formats (gzip, bzip2)
Minimal memory footprint with default heap allocation of 300MB
Processes reads in batches to optimize throughput
Preserves original read quality scores and sequence data unchanged

Comment Handling

When parseextra=true is specified, the tool can handle BGI headers that contain additional comments after the main header ID. These comments are separated by whitespace and are preserved in the converted Illumina header, appended after a tab character. This feature is disabled by default for performance reasons, as parsing comments requires additional string processing.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org