Cbcl2Text
Converts Illumina CBCL (Compressed Base Call) files to text format. Extracts base calls, quality scores, and flowcell coordinates from binary CBCL files. Supports automatic read structure parsing from RunInfo.xml or manual read splitting. Coordinates are transformed to Illumina FASTQ format.
Basic Usage
cbcl2text.sh runfolder=<path> out=<file> lane=<int>
Reads s.locs files for cluster positions, .filter files for pass-filter flags, and .cbcl files for base calls and quality scores.
Parameters
Parameters control input/output locations, read structure parsing, and formatting options.
Standard Parameters
- runfolder=<dir>
- Path to Illumina run folder containing Data/Intensities directory structure. This is the top-level directory created by the sequencer.
- out=<file>
- Output file for tab-delimited text. Supports standard output formats including plain text and FASTQ.
- lane=<int>
- Lane number to process (default: 1). Must match the lane directory structure in BaseCalls/L001, L002, etc.
Optional Parameters
- tiles=<list>
- Comma-separated tile numbers to process (e.g., tiles=1101,1102). Default: process all tiles found in lane directory.
- length=<mode>
- Read splitting mode. Options: (none) - Concatenate all cycles (default); auto - Parse RunInfo.xml for read structure; 151,19,10,151 - Manual read lengths as comma-delimited list (R1,I1,I2,R2).
Java Parameters
- -Xmx
- Set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 8g (fixed allocation).
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Output Format
Default Format (Concatenated)
tile X Y PF bases(all_cycles) quals(all_cycles)
All cycles concatenated into single sequence and quality strings.
Split Format (with length parameter)
tile X Y PF R1,I1,I2,R2 Q1,QI1,QI2,Q2
Reads split into separate comma-delimited fields: R1 (read 1), I1 (index 1), I2 (index 2), R2 (read 2).
Coordinate Transformation
X and Y coordinates are transformed to Illumina FASTQ format using the formula: round(10 × raw + 1000)
Quality Score Binning
Illumina bins quality scores to 2 bits (values 0-3) in CBCL files. This is a sequencer-level compression and cannot be reversed.
Examples
Basic Conversion
cbcl2text.sh runfolder=./151T8B8B151T_cbcl out=output.txt lane=1
Process all tiles from lane 1 with default concatenated output.
Automatic Read Structure
cbcl2text.sh runfolder=./NovaSeq_run out=reads.txt lane=1 length=auto
Parse RunInfo.xml to automatically determine read structure (e.g., 151T8B8B151T).
Manual Read Splitting
cbcl2text.sh runfolder=./run_folder out=split.txt lane=1 length=151,19,10,151
Manually specify read lengths: 151bp read 1, 19bp index 1, 10bp index 2, 151bp read 2.
Specific Tiles
cbcl2text.sh runfolder=./run out=tiles.txt lane=1 tiles=1101,1102,1201
Process only specified tiles instead of all tiles in lane.
Algorithm Details
Processing Pipeline
- Position Loading: Read cluster X,Y positions from s.locs binary file (all clusters in flowcell)
- Tile Detection: Scan for available .filter files to determine which tiles to process
- Filter Reading: Load pass-filter flags for each cluster from .filter files
- Cycle Processing: For each cycle directory (C1.1, C2.1, ..., C318.1):
- Determine which surface (1 or 2) contains the tile
- Read CBCL file for appropriate surface
- Decompress gzip blocks and unpack 2-bit encoded bases and qualities
- Append to per-cluster sequence and quality strings
- Read Splitting: If length parameter specified, split concatenated sequences into R1/I1/I2/R2 based on cycle positions
- Output Writing: Write tab-delimited records for each cluster with transformed coordinates
Memory Requirements
Memory usage is approximately 1.5GB per tile. The tool uses an 8GB default allocation for typical flowcell configurations.
CBCL Format Characteristics
- 2-bit encoding: Bases (A=0, C=1, G=2, T=3, N=0 with quality 0) and qualities (0-3 bins) packed using 2 bits per value
- Gzip compression: Each cycle stored in compressed blocks within .cbcl files
- Surface organization: Tiles distributed across two surfaces, each with separate .cbcl files per cycle
- Header metadata: Stores tile locations within compressed files
Support
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.