PlotReadPosition

Basic Usage

plotreadposition.sh in= out= expected=

This tool processes Illumina FASTQ files to extract positional information and barcode data from read headers, then calculates hamming distances to expected barcodes.

Parameters

Input/Output Parameters

File Parameters

in=<file>: Input FASTQ file containing Illumina reads with positional and barcode information in headers. Can be compressed (gzip).
out=<file>: Output TSV file with three columns: x position, y position, and hamming distance to closest expected barcode.
expected=<file>
names=<file>
barcodes=<file>: File containing expected barcode sequences, one per line. Used to calculate hamming distances from observed barcodes in read headers.

Processing Parameters

Read Processing

maxreads=-1: Maximum number of reads to process. Default (-1) processes all reads in the input file.

Examples

Basic Position and Barcode Analysis

plotreadposition.sh in=sample.fq out=positions.tsv expected=barcodes.txt

Analyzes read positions and calculates barcode hamming distances for all reads in sample.fq.

Limited Read Processing

plotreadposition.sh in=large_sample.fq out=subset_positions.tsv expected=barcodes.txt maxreads=1000000

Processes only the first 1 million reads from a large FASTQ file.

Compressed Input

plotreadposition.sh in=sample.fq.gz out=positions.tsv expected=expected_barcodes.txt

Processes compressed FASTQ input, automatically detecting gzip format.

Output Format

The output TSV file contains three tab-separated columns with a header line:

x	y	hdist
1001	1523	0
1002	1523	2
1003	1523	1
...

x: X coordinate position from the Illumina read header
y: Y coordinate position from the Illumina read header
hdist: Hamming distance to the closest expected barcode (0 = perfect match)

Algorithm Details

Header Parsing Strategy

PlotReadPosition uses IlluminaHeaderParser2 to extract structured information from Illumina FASTQ headers. The tool specifically targets:

Position Extraction: Parses x,y coordinate data from standard Illumina header formats
Barcode Identification: Extracts barcode sequences from read headers based on configured delimiters
Dual Barcode Support: Handles both single and dual-indexed barcode systems

Barcode Distance Calculation

The tool employs PCRMatrixHDist for barcode matching:

Hamming Distance Computation: Calculates edit distance between observed and expected barcodes
Closest Match Selection: Identifies the expected barcode with minimum hamming distance
Distance Thresholding: Uses configurable maximum distance (99) for barcode matching
Fallback Handling: Reports barcode length as distance when no close match is found

Processing Architecture

The implementation uses concurrent stream processing:

Concurrent Input: ConcurrentReadInputStream enables parallel read processing
Batch Processing: Reads are processed in batches to optimize memory usage
Stream Output: ConcurrentReadOutputStream provides TSV writing
Memory Management: Default 300MB heap allocation suitable for most datasets

Data Structure Usage

PlotReadPosition utilizes specialized data structures for performance:

PCRMatrixHDist: Matrix-based structure for fast barcode lookups and distance calculations
ByteBuilder: String construction for output formatting
Barcode Objects: Structured representation of expected barcodes with distance methods

Performance Characteristics

Memory Usage: Default 300MB heap allocation, configurable via standard JVM parameters
Processing Speed: Concurrent processing architecture enables high throughput
Scalability: Handles large FASTQ files through batch processing and stream I/O
Format Support: Automatic detection and processing of compressed (gzip) inputs

Use Cases

Quality Control: Analyzing spatial distribution of reads across flow cell tiles
Barcode Validation: Assessing barcode quality and identifying systematic errors
Sequencing Diagnostics: Detecting positional biases or clustering issues
Data Visualization: Generating data for positional heat maps and barcode accuracy plots

Technical Notes

Header Format Dependency: Requires standard Illumina header formats with position and barcode data
Barcode File Format: Expected barcode file should contain one barcode sequence per line
Distance Metrics: Uses hamming distance (character-by-character comparison)
Output Precision: All coordinate and distance values are reported as integers

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org