CompareLabels
Compares delimited labels in read headers to count how many match. The 'unknown' label is a special case. The original goal was to measure the differences between demultiplexing methods. Labels can be added with the rename.sh suffix flag, or the novademux.sh rename+nosplit flags, or seal.sh with rename, addcount=f, and tophitonly. The assumption is that a header will look like: @VP2:12:H7:2:1101:8:2 1:N:0:CAAC (tab) CAAC (tab) CAAC ...in which case the labels CAAC would be compared and found equal.
Basic Usage
comparelabels.sh in=<input file> out=<output file>
Input may be fasta or fastq, compressed or uncompressed. The tool processes read headers to extract and compare labels separated by the specified delimiter.
Parameters
Parameters control input/output behavior and label comparison settings. The tool extracts the last two delimiter-separated terms from read headers for comparison.
Standard parameters
- in=<file>
- Primary input, or read 1 input. Can be fasta or fastq format, compressed or uncompressed.
- out=stdout
- Print the results to this destination. Default is stdout but a file may be specified. Output contains summary statistics of label comparisons.
- labelstats=
- Optional destination for per-label stats. When specified, creates a
LinkedHashMap<String, Label>
to track individual label performance. Output includes counts (total, count1, count2), component statistics (aa, au, ua, ab, ba), yield calculations, and contamination rates in PPM using the nestedLabel.appendTo()
method. - quantset=<file>
- If set, ignore reads with labels not contained in this file; one label per line. The tool creates a
HashSet<String>
from file contents with automatic inclusion of "UNKNOWN" viaquantSet.add("UNKNOWN")
. Reads with labels not in this set are marked as invalid usinginvalidCount++
and excluded from analysis. - swap=f
- Swap the order of label 1 and label 2. When true, the parser extracts terms using
lp.parseString(terms-(swap ? 1 : 2))
for s1 andlp.parseString(terms-(swap ? 2 : 1))
for s2, effectively reversing the comparison order. Default: false. - delimiter=tab
- Compare the last two terms in the header, using this single-character delimiter. Most symbols can be expressed as literals (e.g. 'delimiter=_' for underscore) but you can also spell out some of the problematic ones: space, tab, pound, greaterthan, lessthan, equals, colon, semicolon, bang, and, quote, singlequote, backslash, hat, dollar, dot, pipe, questionmark, star, plus, openparen, closeparen, opensquare, opencurly
Output Terminology
- aa
- Both labels were equal. Indicates successful matching between the two label positions.
- uu
- Both labels were unknown. Both positions contained the "unknown" label.
- au
- Label 1 was assigned, label 2 was unknown. The first label had a value while the second was "unknown".
- ua
- Label 1 was unknown, label 2 was assigned. The first label was "unknown" while the second had a value.
- ab
- Both labels were assigned, but not equal. For per-label stats, indicates label 1 was assigned to this, and label 2 was assigned to something else. Represents label mismatches.
- ba
- In per-label stats, indicates label 2 was assigned to this and label 1 was assigned to something else. The reverse of ab for per-label analysis.
- yield
- Fraction of reads assigned to the same label. E.g. if aa=10, au=1, ab=2, then yield2 = aa/(aa+au+ab) = 10/13 = 0.77. Measures successful labeling consistency.
- contam
- Fraction of reads assigned to a different label, using the other as ground truth. For example, if aa=10, au=1, ab=2, then contam1=ab/(aa+au+ab) = 2/13 = 0.154 = 153,846 PPM. Measures cross-contamination between labels.
Examples
Basic Demultiplexing Comparison
comparelabels.sh in=demux_reads.fastq out=comparison_stats.txt
Compares labels in the last two tab-delimited fields of read headers, outputting summary statistics.
Custom Delimiter with Per-Label Stats
comparelabels.sh in=reads.fq delimiter=underscore labelstats=per_label.txt
Uses underscore as delimiter instead of tab and generates detailed per-label statistics.
Filtered Analysis with Quantset
comparelabels.sh in=sample.fq quantset=valid_barcodes.txt swap=true
Only analyzes reads with labels present in the quantset file, with swapped label positions.
Algorithm Details
CompareLabels implements string parsing and statistical comparison operations for evaluating label assignment accuracy across demultiplexing methods:
Header Parsing Implementation
The tool uses LineParserS1
class with configurable delimiter character for header tokenization. The parser extracts the last two delimiter-separated terms using lp.parseString(terms-(swap ? 1 : 2))
and lp.parseString(terms-(swap ? 2 : 1))
methods. The Parse.parseSymbolToCharacter()
method converts symbolic delimiter names (space, tab, pound, etc.) to their character representations.
Label Classification Logic
The algorithm implements boolean comparison logic to categorize each read into five mutually exclusive categories:
- AA (Both Assigned, Equal):
s1.equals(s2)
returns true and neither is "UNKNOWN" - UU (Both Unknown): Both
s1.equalsIgnoreCase("UNKNOWN")
ands2.equalsIgnoreCase("UNKNOWN")
return true - AU (Assigned-Unknown): First label passes
!unknown1
and second passesunknown2
- UA (Unknown-Assigned): First label passes
unknown1
and second passes!unknown2
- AB (Assigned-Both, Different): Both labels assigned but
!s1.equals(s2)
Statistical Calculation Methods
The tool implements specific mathematical formulations for performance metrics:
- Relative Yield Calculation:
ryield1 = aaCount * (1f/Tools.max(count2, 1))
andryield2 = aaCount * (1f/Tools.max(count1, 1))
- Contamination Rate:
contam1 = abCount * (1f/Tools.max(count1, 1))
multiplied by 1,000,000 for PPM conversion - Per-Label Statistics: Uses nested
Label
class with methodstp()
returningaa
,fp()
returningba
, andfn()
returningau
Concurrent Read Processing
The algorithm uses ConcurrentReadInputStream.getReadInputStream()
for parallel I/O operations. Read processing occurs in batches through ListNum<Read>
objects, with processRead()
method handling individual read analysis. Running counts are maintained in primitive long fields (aaCount, auCount, etc.) rather than storing read objects.
Label Tracking Data Structures
Per-label statistics use LinkedHashMap<String, Label>
for ordered key preservation. The getLabel()
method implements lazy initialization: if(l==null) { l=new Label(s); map.put(s, l); }
. Quantset filtering creates HashSet<String>
with automatic "UNKNOWN" inclusion via quantSet.add("UNKNOWN")
.
Memory Usage Characteristics
The tool processes reads with O(1) memory per read for basic statistics. Per-label tracking scales as O(n) where n equals unique label count. The LongList termCounts
histogram tracks delimiter-separated term frequency distribution. Default JVM memory allocation is 300MB through z="-Xmx300m"
setting in the shell script.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org