CompareLabels

Script: comparelabels.sh Package: barcode Class: CompareLabels.java

Compares delimited labels in read headers to count how many match. The 'unknown' label is a special case. The original goal was to measure the differences between demultiplexing methods. Labels can be added with the rename.sh suffix flag, or the novademux.sh rename+nosplit flags, or seal.sh with rename, addcount=f, and tophitonly. The assumption is that a header will look like: @VP2:12:H7:2:1101:8:2 1:N:0:CAAC (tab) CAAC (tab) CAAC ...in which case the labels CAAC would be compared and found equal.

Basic Usage

comparelabels.sh in=<input file> out=<output file>

Input may be fasta or fastq, compressed or uncompressed. The tool processes read headers to extract and compare labels separated by the specified delimiter.

Parameters

Parameters control input/output behavior and label comparison settings. The tool extracts the last two delimiter-separated terms from read headers for comparison.

Standard parameters

in=<file>: Primary input, or read 1 input. Can be fasta or fastq format, compressed or uncompressed.
out=stdout: Print the results to this destination. Default is stdout but a file may be specified. Output contains summary statistics of label comparisons.
labelstats=: Optional destination for per-label stats. When specified, creates a LinkedHashMap<String, Label> to track individual label performance. Output includes counts (total, count1, count2), component statistics (aa, au, ua, ab, ba), yield calculations, and contamination rates in PPM using the nested Label.appendTo() method.
quantset=<file>: If set, ignore reads with labels not contained in this file; one label per line. The tool creates a HashSet<String> from file contents with automatic inclusion of "UNKNOWN" via quantSet.add("UNKNOWN"). Reads with labels not in this set are marked as invalid using invalidCount++ and excluded from analysis.
swap=f: Swap the order of label 1 and label 2. When true, the parser extracts terms using lp.parseString(terms-(swap ? 1 : 2)) for s1 and lp.parseString(terms-(swap ? 2 : 1)) for s2, effectively reversing the comparison order. Default: false.
delimiter=tab: Compare the last two terms in the header, using this single-character delimiter. Most symbols can be expressed as literals (e.g. 'delimiter=_' for underscore) but you can also spell out some of the problematic ones: space, tab, pound, greaterthan, lessthan, equals, colon, semicolon, bang, and, quote, singlequote, backslash, hat, dollar, dot, pipe, questionmark, star, plus, openparen, closeparen, opensquare, opencurly

Output Terminology

aa: Both labels were equal. Indicates successful matching between the two label positions.
uu: Both labels were unknown. Both positions contained the "unknown" label.
au: Label 1 was assigned, label 2 was unknown. The first label had a value while the second was "unknown".
ua: Label 1 was unknown, label 2 was assigned. The first label was "unknown" while the second had a value.
ab: Both labels were assigned, but not equal. For per-label stats, indicates label 1 was assigned to this, and label 2 was assigned to something else. Represents label mismatches.
ba: In per-label stats, indicates label 2 was assigned to this and label 1 was assigned to something else. The reverse of ab for per-label analysis.
yield: Fraction of reads assigned to the same label. E.g. if aa=10, au=1, ab=2, then yield2 = aa/(aa+au+ab) = 10/13 = 0.77. Measures successful labeling consistency.
contam: Fraction of reads assigned to a different label, using the other as ground truth. For example, if aa=10, au=1, ab=2, then contam1=ab/(aa+au+ab) = 2/13 = 0.154 = 153,846 PPM. Measures cross-contamination between labels.

Examples

Basic Demultiplexing Comparison

comparelabels.sh in=demux_reads.fastq out=comparison_stats.txt

Compares labels in the last two tab-delimited fields of read headers, outputting summary statistics.

Custom Delimiter with Per-Label Stats

comparelabels.sh in=reads.fq delimiter=underscore labelstats=per_label.txt

Uses underscore as delimiter instead of tab and generates detailed per-label statistics.

Filtered Analysis with Quantset

comparelabels.sh in=sample.fq quantset=valid_barcodes.txt swap=true

Only analyzes reads with labels present in the quantset file, with swapped label positions.

Algorithm Details

CompareLabels implements string parsing and statistical comparison operations for evaluating label assignment accuracy across demultiplexing methods:

Header Parsing Implementation

The tool uses LineParserS1 class with configurable delimiter character for header tokenization. The parser extracts the last two delimiter-separated terms using lp.parseString(terms-(swap ? 1 : 2)) and lp.parseString(terms-(swap ? 2 : 1)) methods. The Parse.parseSymbolToCharacter() method converts symbolic delimiter names (space, tab, pound, etc.) to their character representations.

Label Classification Logic

The algorithm implements boolean comparison logic to categorize each read into five mutually exclusive categories:

AA (Both Assigned, Equal): s1.equals(s2) returns true and neither is "UNKNOWN"
UU (Both Unknown): Both s1.equalsIgnoreCase("UNKNOWN") and s2.equalsIgnoreCase("UNKNOWN") return true
AU (Assigned-Unknown): First label passes !unknown1 and second passes unknown2
UA (Unknown-Assigned): First label passes unknown1 and second passes !unknown2
AB (Assigned-Both, Different): Both labels assigned but !s1.equals(s2)

Statistical Calculation Methods

The tool implements specific mathematical formulations for performance metrics:

Relative Yield Calculation: ryield1 = aaCount * (1f/Tools.max(count2, 1)) and ryield2 = aaCount * (1f/Tools.max(count1, 1))
Contamination Rate: contam1 = abCount * (1f/Tools.max(count1, 1)) multiplied by 1,000,000 for PPM conversion
Per-Label Statistics: Uses nested Label class with methods tp() returning aa, fp() returning ba, and fn() returning au

Concurrent Read Processing

The algorithm uses ConcurrentReadInputStream.getReadInputStream() for parallel I/O operations. Read processing occurs in batches through ListNum<Read> objects, with processRead() method handling individual read analysis. Running counts are maintained in primitive long fields (aaCount, auCount, etc.) rather than storing read objects.

Label Tracking Data Structures

Per-label statistics use LinkedHashMap<String, Label> for ordered key preservation. The getLabel() method implements lazy initialization: if(l==null) { l=new Label(s); map.put(s, l); }. Quantset filtering creates HashSet<String> with automatic "UNKNOWN" inclusion via quantSet.add("UNKNOWN").

Memory Usage Characteristics

The tool processes reads with O(1) memory per read for basic statistics. Per-label tracking scales as O(n) where n equals unique label count. The LongList termCounts histogram tracks delimiter-separated term frequency distribution. Default JVM memory allocation is 300MB through z="-Xmx300m" setting in the shell script.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org