Cbcl2Text

Script: cbcl2text.sh Package: illumina Class: Cbcl2Text.java

Converts Illumina CBCL (Compressed Base Call) files to text format. Extracts base calls, quality scores, and flowcell coordinates from binary CBCL files. Supports automatic read structure parsing from RunInfo.xml or manual read splitting. Coordinates are transformed to Illumina FASTQ format.

Basic Usage

cbcl2text.sh runfolder=<path> out=<file> lane=<int>

Reads s.locs files for cluster positions, .filter files for pass-filter flags, and .cbcl files for base calls and quality scores.

Parameters

Parameters control input/output locations, read structure parsing, and formatting options.

Standard Parameters

runfolder=<dir>
Path to Illumina run folder containing Data/Intensities directory structure. This is the top-level directory created by the sequencer.
out=<file>
Output file for tab-delimited text. Supports standard output formats including plain text and FASTQ.
lane=<int>
Lane number to process (default: 1). Must match the lane directory structure in BaseCalls/L001, L002, etc.

Optional Parameters

tiles=<list>
Comma-separated tile numbers to process (e.g., tiles=1101,1102). Default: process all tiles found in lane directory.
length=<mode>
Read splitting mode. Options: (none) - Concatenate all cycles (default); auto - Parse RunInfo.xml for read structure; 151,19,10,151 - Manual read lengths as comma-delimited list (R1,I1,I2,R2).

Java Parameters

-Xmx
Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 8g (fixed allocation).
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Output Format

Default Format (Concatenated)

tile    X       Y       PF      bases(all_cycles)       quals(all_cycles)

All cycles concatenated into single sequence and quality strings.

Split Format (with length parameter)

tile    X       Y       PF      R1,I1,I2,R2             Q1,QI1,QI2,Q2

Reads split into separate comma-delimited fields: R1 (read 1), I1 (index 1), I2 (index 2), R2 (read 2).

Coordinate Transformation

X and Y coordinates are transformed to Illumina FASTQ format using the formula: round(10 × raw + 1000)

Quality Score Binning

Illumina bins quality scores to 2 bits (values 0-3) in CBCL files. This is a sequencer-level compression and cannot be reversed.

Examples

Basic Conversion

cbcl2text.sh runfolder=./151T8B8B151T_cbcl out=output.txt lane=1

Process all tiles from lane 1 with default concatenated output.

Automatic Read Structure

cbcl2text.sh runfolder=./NovaSeq_run out=reads.txt lane=1 length=auto

Parse RunInfo.xml to automatically determine read structure (e.g., 151T8B8B151T).

Manual Read Splitting

cbcl2text.sh runfolder=./run_folder out=split.txt lane=1 length=151,19,10,151

Manually specify read lengths: 151bp read 1, 19bp index 1, 10bp index 2, 151bp read 2.

Specific Tiles

cbcl2text.sh runfolder=./run out=tiles.txt lane=1 tiles=1101,1102,1201

Process only specified tiles instead of all tiles in lane.

Algorithm Details

Processing Pipeline

  1. Position Loading: Read cluster X,Y positions from s.locs binary file (all clusters in flowcell)
  2. Tile Detection: Scan for available .filter files to determine which tiles to process
  3. Filter Reading: Load pass-filter flags for each cluster from .filter files
  4. Cycle Processing: For each cycle directory (C1.1, C2.1, ..., C318.1):
    • Determine which surface (1 or 2) contains the tile
    • Read CBCL file for appropriate surface
    • Decompress gzip blocks and unpack 2-bit encoded bases and qualities
    • Append to per-cluster sequence and quality strings
  5. Read Splitting: If length parameter specified, split concatenated sequences into R1/I1/I2/R2 based on cycle positions
  6. Output Writing: Write tab-delimited records for each cluster with transformed coordinates

Memory Requirements

Memory usage is approximately 1.5GB per tile. The tool uses an 8GB default allocation for typical flowcell configurations.

CBCL Format Characteristics

Support

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.