PlotReadPosition
Plots Illumina read positions and barcode hamming distance by analyzing read headers and calculating distances to expected barcodes.
Basic Usage
plotreadposition.sh in= out= expected=
This tool processes Illumina FASTQ files to extract positional information and barcode data from read headers, then calculates hamming distances to expected barcodes.
Parameters
Input/Output Parameters
File Parameters
- in=<file>
- Input FASTQ file containing Illumina reads with positional and barcode information in headers. Can be compressed (gzip).
- out=<file>
- Output TSV file with three columns: x position, y position, and hamming distance to closest expected barcode.
- expected=<file>
- names=<file>
- barcodes=<file>
- File containing expected barcode sequences, one per line. Used to calculate hamming distances from observed barcodes in read headers.
Processing Parameters
Read Processing
- maxreads=-1
- Maximum number of reads to process. Default (-1) processes all reads in the input file.
Examples
Basic Position and Barcode Analysis
plotreadposition.sh in=sample.fq out=positions.tsv expected=barcodes.txt
Analyzes read positions and calculates barcode hamming distances for all reads in sample.fq.
Limited Read Processing
plotreadposition.sh in=large_sample.fq out=subset_positions.tsv expected=barcodes.txt maxreads=1000000
Processes only the first 1 million reads from a large FASTQ file.
Compressed Input
plotreadposition.sh in=sample.fq.gz out=positions.tsv expected=expected_barcodes.txt
Processes compressed FASTQ input, automatically detecting gzip format.
Output Format
The output TSV file contains three tab-separated columns with a header line:
x y hdist
1001 1523 0
1002 1523 2
1003 1523 1
...
- x: X coordinate position from the Illumina read header
- y: Y coordinate position from the Illumina read header
- hdist: Hamming distance to the closest expected barcode (0 = perfect match)
Algorithm Details
Header Parsing Strategy
PlotReadPosition uses IlluminaHeaderParser2 to extract structured information from Illumina FASTQ headers. The tool specifically targets:
- Position Extraction: Parses x,y coordinate data from standard Illumina header formats
- Barcode Identification: Extracts barcode sequences from read headers based on configured delimiters
- Dual Barcode Support: Handles both single and dual-indexed barcode systems
Barcode Distance Calculation
The tool employs PCRMatrixHDist for barcode matching:
- Hamming Distance Computation: Calculates edit distance between observed and expected barcodes
- Closest Match Selection: Identifies the expected barcode with minimum hamming distance
- Distance Thresholding: Uses configurable maximum distance (99) for barcode matching
- Fallback Handling: Reports barcode length as distance when no close match is found
Processing Architecture
The implementation uses concurrent stream processing:
- Concurrent Input: ConcurrentReadInputStream enables parallel read processing
- Batch Processing: Reads are processed in batches to optimize memory usage
- Stream Output: ConcurrentReadOutputStream provides TSV writing
- Memory Management: Default 300MB heap allocation suitable for most datasets
Data Structure Usage
PlotReadPosition utilizes specialized data structures for performance:
- PCRMatrixHDist: Matrix-based structure for fast barcode lookups and distance calculations
- ByteBuilder: String construction for output formatting
- Barcode Objects: Structured representation of expected barcodes with distance methods
Performance Characteristics
- Memory Usage: Default 300MB heap allocation, configurable via standard JVM parameters
- Processing Speed: Concurrent processing architecture enables high throughput
- Scalability: Handles large FASTQ files through batch processing and stream I/O
- Format Support: Automatic detection and processing of compressed (gzip) inputs
Use Cases
- Quality Control: Analyzing spatial distribution of reads across flow cell tiles
- Barcode Validation: Assessing barcode quality and identifying systematic errors
- Sequencing Diagnostics: Detecting positional biases or clustering issues
- Data Visualization: Generating data for positional heat maps and barcode accuracy plots
Technical Notes
- Header Format Dependency: Requires standard Illumina header formats with position and barcode data
- Barcode File Format: Expected barcode file should contain one barcode sequence per line
- Distance Metrics: Uses hamming distance (character-by-character comparison)
- Output Precision: All coordinate and distance values are reported as integers
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org