CountBarcodes2

Script: countbarcodes2.sh Package: barcode Class: CountBarcodes2.java

Counts and analyzes barcode frequencies in sequencing reads using BarcodeStats HashMap-based counting and IlluminaHeaderParser2 for barcode extraction. Supports PCRMatrix-based assignment with three algorithms: Hamming distance (PCRMatrixHDist), edit distance, and probabilistic matching (PCRMatrixProb) using expectation-maximization with position-specific error matrices.

Basic Usage

countbarcodes2.sh in=<file> counts=<file>

Input may be stdin or a fasta or fastq file, raw or gzipped. Read names should end in a colon then barcode sequence.

Parameters

Parameters are organized by function: input specification, output options, processing controls, and Java runtime settings. The tool can operate on live sequencing data or pre-counted barcode files.

Input Parameters

in=<file>: Input reads, whose names end in a colon then barcode. Can be FASTA or FASTQ, raw or gzipped. Use 'stdin' for standard input.
countsin=<file>: Input of counts; optional. Pre-counted barcode frequency file instead of processing raw reads.
quantset=<file>: Only quantify barcodes in this file. Restricts analysis to specific barcodes of interest.
interleaved=auto: If true, fastq input will be considered interleaved. Default: auto-detect based on file structure.
expected=: Comma-delimited list of expected barcodes. Required for assignment and contamination analysis.

Output parameters

maxrows=-1: Optionally limit the number of rows printed. Default: -1 (no limit).
printheader=t: Print a header line in output files. Default: true.
out=<file>: (counts) Write barcodes and counts here. Use 'out=stdout' to pipe to standard out.
barcodesout=<file>: Barcode assignment counts. Summary of reads assigned to each expected barcode.
mapout=<file>: Map of observed to expected barcode assignments. Shows which observed barcodes map to which expected ones.
outcontam=<file>: Requires labeled data, and causes contamination quantification. Analyzes cross-contamination between samples.

Processing Parameters

countundefined=t: Count barcodes that contain non-ACGT symbols. Default: true. Set false to ignore ambiguous barcodes.
pcrmatrix=f: Use a PCRMatrix for barcode assignment. Creates PCRMatrix.create(bs.length1, bs.length2, bs.delimiter) and populates with expected barcodes via populateExpected(). Enables assignment via makeAssignmentMap() with configurable distance thresholds.
mode=hdist: PCRMatrix algorithm selection. hdist: PCRMatrixHDist with configurable maxHDist0 threshold and multi-threaded stride processing. edist: edit distance with banded alignment. prob: PCRMatrixProb using 3D probability matrices and expectation-maximization with position-specific error modeling.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions. May improve performance slightly in production use.

Examples

Basic Barcode Counting

countbarcodes2.sh in=demux_reads.fq out=barcode_counts.txt

Count all barcodes found in read names and output frequency table.

PCR Matrix Assignment

countbarcodes2.sh in=reads.fq expected=expected_barcodes.txt pcrmatrix=t mode=hdist mapout=assignments.txt barcodesout=summary.txt

Assign observed barcodes to expected ones using Hamming distance matching.

Contamination Analysis

countbarcodes2.sh in=labeled_reads.fq expected=barcodes.txt pcrmatrix=t outcontam=contamination.txt quantset=quantify.txt

Analyze cross-contamination between samples using labeled data.

Processing Pre-counted Data

countbarcodes2.sh countsin=existing_counts.txt expected=barcodes.txt pcrmatrix=t barcodesout=assignments.txt

Process existing barcode counts instead of raw sequencing data.

Algorithm Details

Barcode Extraction and Counting: The tool utilizes IlluminaHeaderParser2.parse() to extract barcode sequences from read headers following the colon delimiter convention. BarcodeStats.increment() maintains frequency counts using HashMap-based storage (BarcodeStats.codeMap) that distinguishes between left and right barcode components for dual-indexed libraries. Barcode length detection is automatic via ff1.barcodeLength() calls for both components.

Error-Tolerant Assignment: PCR matrix assignment employs three distinct algorithms: PCRMatrixHDist implements Hamming distance calculation with configurable maxHDist0 threshold and multi-threaded stride-based processing; edit distance mode uses banded alignment for insertions/deletions/substitutions; PCRMatrixProb implements expectation-maximization algorithm with 3D probability matrices [position][called_base][reference_base] for position-specific error modeling and 4-5 iterative refinement passes.

Contamination Quantification: The contamination analysis requires IlluminaHeaderParser2.PARSE_COMMENT=true to access true barcode labels stored in read comments. Cross-contamination calculation uses LongPair objects to track correct (lp.a) vs incorrect (lp.b) assignments per barcode. PPM rates are computed as lp.b*1000000.0/Tools.max(1.0, (lp.a+lp.b)) with geometric averages calculated using Math.exp(logsum/denominator) across all barcodes.

Tile-Based Analysis: When PCRMatrix.byTile=true, the system extracts tile numbers using Tools.trailingDigits() from barcode strings and associates them via ihp.tile() calls. Tile information enables spatial analysis using key construction: (addTile ? barcode+tile : barcode) for assignment mapping and contamination tracking across flow cell positions.

Concurrent Processing: Read processing utilizes ConcurrentReadInputStream with configurable buffer management via Shared.capBuffers(4) for single-threaded operations. The tool employs ListNum<Read> batch processing with cris.nextList()/cris.returnList() cycle management. BarcodeStats operations are thread-safe through synchronized increment methods, while PCR matrix calculations support parallel processing via stride-based thread partitioning.

Output Formats

Standard Counts (out): Tab-delimited format with barcode sequence and count columns. Can optionally include tile information.

Assignment Map (mapout): Shows mapping from observed barcodes to expected barcodes, including confidence scores and distance metrics.

Barcode Summary (barcodesout): Aggregated counts for each expected barcode, showing total reads assigned.

Contamination Report (outcontam): Detailed contamination analysis with PPM rates, geometric averages, and per-barcode breakdown.

Performance Notes

Memory Management: Memory usage is determined by HashMap storage in BarcodeStats.codeMap for unique barcode tracking. Default allocation via calcmem.sh sets initial heap to -Xmx2g, but scales with barcode diversity. PCR matrix operations maintain 3D probability arrays [position][called_base][reference_base] requiring additional memory proportional to barcode length and alphabet size. ByteFile buffer management uses FORCE_MODE_BF2=true when Shared.threads()>2 for optimized I/O.

Concurrent Processing: ConcurrentReadInputStream enables parallel read processing with configurable thread counts via ReadWrite.setZipThreads(Shared.threads()). PCRMatrixHDist implements multi-threaded barcode assignment using stride-based partitioning (stride = counts.size() / Shared.threads()). ListNum<Read> batch processing minimizes thread synchronization overhead through bulk read handling and cris.nextList()/returnList() cycles.

I/O Optimization: Streaming processing via ByteStreamWriter with configurable buffer sizes prevents memory accumulation for large output files. Compression handling through ReadWrite.USE_PIGZ=true and ReadWrite.USE_UNPIGZ=true enables parallel gzip processing. File format detection via FileFormat.testInput() optimizes parsing strategy based on input type (FASTQ/FASTA) and compression status.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org