CountBarcodes

Script: countbarcodes.sh Package: barcode Class: CountBarcodes.java

Counts the number of reads with each barcode, calculates distance metrics to expected barcodes, and validates against known barcode sets. Essential for demultiplexing quality control and barcode analysis in sequencing workflows.

Basic Usage

countbarcodes.sh in=<file> counts=<file>

Input may be stdin or a fasta or fastq file, raw or gzipped. Read names should end in a colon followed by the barcode sequence.

Parameters

Parameters are organized into input specifications, output destinations, and Java runtime settings. The tool processes reads with barcodes appended to read IDs and generates comprehensive statistics including distance metrics to expected barcodes.

Input parameters

in=<file>: Input reads, whose names end in a colon then barcode. Accepts fasta or fastq format, raw or gzipped. Can also use 'stdin' for piped input.
counts=<file>: Output file for barcode counts and statistics. Contains tab-delimited columns: code, count, Hamming_dist, edit_dist, valid.
interleaved=auto: If true, forces fastq input to be paired and interleaved. Auto-detection analyzes file structure to determine pairing.
qin=auto: ASCII offset for input quality scores. Options: 33 (Sanger/Illumina 1.8+), 64 (Illumina 1.3-1.7), or auto for automatic detection.
unpigz=t: Use pigz (parallel gzip) for decompression of gzipped input files. Provides faster decompression on multi-core systems.
expected=: Comma-delimited list of expected barcode sequences. Used to calculate Hamming and edit distances. Expected barcodes are automatically added to the valid set.
valid=: Comma-delimited list of valid barcode sequences. Barcodes in this list are marked as 'valid' in the output, separate from distance calculations.
countundefined=t: Count barcodes that contain non-ACGT symbols (N, ambiguous bases). When false, only fully-defined DNA sequences are counted.
printheader=t: Print column header line in output file. Header format: #code count Hamming_dist edit_dist valid
maxrows=-1: Optionally limit the number of barcode rows printed to output. -1 means no limit. Results are sorted by count (highest first) before limiting.

Output parameters

out=<file>: Write processed reads with barcodes to specified file. Use 'out=stdout' to pipe to standard output. Optional parameter for read passthrough.

Java Parameters

-Xmx: Set Java heap memory usage, overriding autodetection. Format: -Xmx20g for 20 GB, -Xmx200m for 200 MB. Maximum typically 85% of physical memory. Default: 200m for this lightweight tool.
-eoom: Exit process if out-of-memory exception occurs. Prevents hanging on memory-limited systems. Requires Java 8u92 or later.
-da: Disable Java assertions for slight performance improvement. Generally not recommended for debugging purposes.

Examples

Basic Barcode Counting

countbarcodes.sh in=reads_with_barcodes.fq counts=barcode_counts.txt

Count all barcodes from reads where read IDs end with ":BARCODE". Output includes counts and distance metrics.

Validation Against Expected Barcodes

countbarcodes.sh in=reads.fq counts=counts.txt expected=ATCG,GCTA,TACG valid=ATCG,GCTA,TACG,NNNN

Count barcodes and calculate distances to expected sequences. Additional valid barcodes (like NNNN) can be specified separately.

Quality Control with Limits

countbarcodes.sh in=reads.fq counts=top_barcodes.txt maxrows=50 countundefined=f printheader=t

Show only top 50 most frequent barcodes, excluding those with undefined bases, with column headers for easy parsing.

Processing Gzipped Input

countbarcodes.sh in=reads.fq.gz counts=counts.txt unpigz=t

Process compressed input using parallel decompression for improved performance on multi-core systems.

Algorithm Details

CountBarcodes implements a hash-based counting algorithm with distance calculation capabilities:

Barcode Extraction and Counting

The tool extracts barcodes from read identifiers using the barcode(true) method, which expects barcodes to be appended after a colon separator. Each unique barcode is stored in a HashMap with its occurrence count, providing O(1) average-case lookup and update performance.

Distance Metrics

Two distance metrics are calculated for quality control:

Hamming Distance: Counts substitutions between sequences of equal length. Calculated as the number of positions where corresponding bases differ.
Edit Distance: Uses a BandedAlignerConcrete with band width 21 to calculate optimal alignment distance, accounting for insertions, deletions, and substitutions. More computationally expensive but more accurate for variable-length barcodes.

Output Processing

Results are sorted in descending order by count using Collections.reverse() after standard sorting. The output format includes:

code: The barcode sequence
count: Number of reads with this barcode
Hamming_dist: Minimum Hamming distance to expected barcodes
edit_dist: Minimum edit distance to expected barcodes (falls back to Hamming distance on alignment failure)
valid: "valid" if barcode is in the valid list, empty otherwise

Memory Efficiency

The tool uses StringNum objects to minimize memory overhead, storing barcode strings with their counts in a single data structure. For large datasets, the HashMap automatically resizes to maintain efficient access patterns while minimizing memory fragmentation.

Quality Filtering

When countundefined=false, barcodes containing ambiguous nucleotides are filtered using AminoAcid.isFullyDefined(), ensuring only A, C, G, T sequences are counted. This helps eliminate low-quality or chimeric barcode sequences.

Output Format

The output file contains tab-delimited columns with the following structure:

#code	count	Hamming_dist	edit_dist	valid
ATCGATCG	15234	0	0	valid
GCTAGCTA	8756	1	1	
TACGTACG	4521	0	0	valid
NNNATCGN	892	2	2

Results are sorted by count in descending order. Distance values represent the minimum distance to any expected barcode.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org