CG2Illumina
Converts BGI/Complete Genomics reads to Illumina header format, and optionally appends barcodes/indexes. For example, @E200008112L1C001R00100063962/1 would become @E200008112:0:FC:1:6396:1:1 1:N:0:
Basic Usage
cg2illumina.sh in=<input file> out=<output file> barcode=<string>
Input may be fasta or fastq, compressed or uncompressed.
Parameters
Parameters are organized into standard input/output options and BGI-specific processing parameters.
Standard parameters
- in=<file>
- Primary input, or read 1 input.
- in2=<file>
- Read 2 input if reads are in two files.
- out=<file>
- Primary output, or read 1 output.
- out2=<file>
- Read 2 output if reads are in two files.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
Processing parameters
- barcode=
- (index) Optionally append a barcode to the header. If specified, the barcode string is added to the end of the converted Illumina header after the colon following the control bits.
- parseextra=f
- Set this to true if the reads headers have comments delimited by a whitespace. When enabled, comments following the header ID are preserved and appended to the converted header after a tab character.
Examples
Basic Conversion
cg2illumina.sh in=bgi_reads.fastq out=illumina_reads.fastq
Converts BGI format headers to Illumina format without adding barcodes.
Conversion with Barcode
cg2illumina.sh in=bgi_reads.fastq out=illumina_reads.fastq barcode=ACGTACGT
Converts headers and appends the specified barcode sequence to each header.
Paired-End Files
cg2illumina.sh in=bgi_R1.fastq in2=bgi_R2.fastq out=illumina_R1.fastq out2=illumina_R2.fastq
Processes paired-end reads from separate files, maintaining pair information in the converted headers.
With Comment Parsing
cg2illumina.sh in=bgi_reads.fastq out=illumina_reads.fastq parseextra=t
Parses and preserves comments that appear after the header ID, separated by whitespace.
Algorithm Details
Header Conversion Strategy
CG2Illumina uses a reverse parsing strategy implemented in the BGIHeaderParser2 class to handle BGI/Complete Genomics header formats that have variable prefixes. The parser uses a LineParserS4Reverse with delimiter pattern "_LCR/" to extract components from BGI headers.
BGI Header Format
The tool recognizes several BGI header formats:
v300056266_run28L3C001R0010057888/1
20A_V100002704L1C001R012000000/1
E200008112L1C001R00100063962/1
Illumina Header Mapping
The conversion maps BGI header components to standard Illumina format fields:
- Machine Name
- Always set to "CG" (Complete Genomics)
- Run Number
- Always set to 0
- Flowcell ID
- Extracted from the first parsed component, defaults to "FC" if unavailable
- Lane Number
- Extracted from the third component (L position)
- Tile Number
- Extracted from the fifth component, digits 3-10
- X Position
- Extracted from the fourth component (C position)
- Y Position
- Extracted from the fifth component, first 3 digits
- Pair Code
- Extracted from the sixth component (read number)
- Chastity
- Always set to 'N' (not filtered)
- Control Bits
- Always set to 0
Performance Characteristics
The tool is designed for efficient processing of large FASTQ files:
- Uses concurrent read streams for parallel I/O processing
- Supports compressed input and output formats (gzip, bzip2)
- Minimal memory footprint with default heap allocation of 300MB
- Processes reads in batches to optimize throughput
- Preserves original read quality scores and sequence data unchanged
Comment Handling
When parseextra=true is specified, the tool can handle BGI headers that contain additional comments after the main header ID. These comments are separated by whitespace and are preserved in the converted Illumina header, appended after a tab character. This feature is disabled by default for performance reasons, as parsing comments requires additional string processing.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org