CountBarcodes
Counts the number of reads with each barcode, calculates distance metrics to expected barcodes, and validates against known barcode sets. Essential for demultiplexing quality control and barcode analysis in sequencing workflows.
Basic Usage
countbarcodes.sh in=<file> counts=<file>
Input may be stdin or a fasta or fastq file, raw or gzipped. Read names should end in a colon followed by the barcode sequence.
Parameters
Parameters are organized into input specifications, output destinations, and Java runtime settings. The tool processes reads with barcodes appended to read IDs and generates comprehensive statistics including distance metrics to expected barcodes.
Input parameters
- in=<file>
- Input reads, whose names end in a colon then barcode. Accepts fasta or fastq format, raw or gzipped. Can also use 'stdin' for piped input.
- counts=<file>
- Output file for barcode counts and statistics. Contains tab-delimited columns: code, count, Hamming_dist, edit_dist, valid.
- interleaved=auto
- If true, forces fastq input to be paired and interleaved. Auto-detection analyzes file structure to determine pairing.
- qin=auto
- ASCII offset for input quality scores. Options: 33 (Sanger/Illumina 1.8+), 64 (Illumina 1.3-1.7), or auto for automatic detection.
- unpigz=t
- Use pigz (parallel gzip) for decompression of gzipped input files. Provides faster decompression on multi-core systems.
- expected=
- Comma-delimited list of expected barcode sequences. Used to calculate Hamming and edit distances. Expected barcodes are automatically added to the valid set.
- valid=
- Comma-delimited list of valid barcode sequences. Barcodes in this list are marked as 'valid' in the output, separate from distance calculations.
- countundefined=t
- Count barcodes that contain non-ACGT symbols (N, ambiguous bases). When false, only fully-defined DNA sequences are counted.
- printheader=t
- Print column header line in output file. Header format: #code count Hamming_dist edit_dist valid
- maxrows=-1
- Optionally limit the number of barcode rows printed to output. -1 means no limit. Results are sorted by count (highest first) before limiting.
Output parameters
- out=<file>
- Write processed reads with barcodes to specified file. Use 'out=stdout' to pipe to standard output. Optional parameter for read passthrough.
Java Parameters
- -Xmx
- Set Java heap memory usage, overriding autodetection. Format: -Xmx20g for 20 GB, -Xmx200m for 200 MB. Maximum typically 85% of physical memory. Default: 200m for this lightweight tool.
- -eoom
- Exit process if out-of-memory exception occurs. Prevents hanging on memory-limited systems. Requires Java 8u92 or later.
- -da
- Disable Java assertions for slight performance improvement. Generally not recommended for debugging purposes.
Examples
Basic Barcode Counting
countbarcodes.sh in=reads_with_barcodes.fq counts=barcode_counts.txt
Count all barcodes from reads where read IDs end with ":BARCODE". Output includes counts and distance metrics.
Validation Against Expected Barcodes
countbarcodes.sh in=reads.fq counts=counts.txt expected=ATCG,GCTA,TACG valid=ATCG,GCTA,TACG,NNNN
Count barcodes and calculate distances to expected sequences. Additional valid barcodes (like NNNN) can be specified separately.
Quality Control with Limits
countbarcodes.sh in=reads.fq counts=top_barcodes.txt maxrows=50 countundefined=f printheader=t
Show only top 50 most frequent barcodes, excluding those with undefined bases, with column headers for easy parsing.
Processing Gzipped Input
countbarcodes.sh in=reads.fq.gz counts=counts.txt unpigz=t
Process compressed input using parallel decompression for improved performance on multi-core systems.
Algorithm Details
CountBarcodes implements a hash-based counting algorithm with distance calculation capabilities:
Barcode Extraction and Counting
The tool extracts barcodes from read identifiers using the barcode(true)
method, which expects barcodes to be appended after a colon separator. Each unique barcode is stored in a HashMap with its occurrence count, providing O(1) average-case lookup and update performance.
Distance Metrics
Two distance metrics are calculated for quality control:
- Hamming Distance: Counts substitutions between sequences of equal length. Calculated as the number of positions where corresponding bases differ.
- Edit Distance: Uses a BandedAlignerConcrete with band width 21 to calculate optimal alignment distance, accounting for insertions, deletions, and substitutions. More computationally expensive but more accurate for variable-length barcodes.
Output Processing
Results are sorted in descending order by count using Collections.reverse() after standard sorting. The output format includes:
- code: The barcode sequence
- count: Number of reads with this barcode
- Hamming_dist: Minimum Hamming distance to expected barcodes
- edit_dist: Minimum edit distance to expected barcodes (falls back to Hamming distance on alignment failure)
- valid: "valid" if barcode is in the valid list, empty otherwise
Memory Efficiency
The tool uses StringNum objects to minimize memory overhead, storing barcode strings with their counts in a single data structure. For large datasets, the HashMap automatically resizes to maintain efficient access patterns while minimizing memory fragmentation.
Quality Filtering
When countundefined=false, barcodes containing ambiguous nucleotides are filtered using AminoAcid.isFullyDefined(), ensuring only A, C, G, T sequences are counted. This helps eliminate low-quality or chimeric barcode sequences.
Output Format
The output file contains tab-delimited columns with the following structure:
#code count Hamming_dist edit_dist valid
ATCGATCG 15234 0 0 valid
GCTAGCTA 8756 1 1
TACGTACG 4521 0 0 valid
NNNATCGN 892 2 2
Results are sorted by count in descending order. Distance values represent the minimum distance to any expected barcode.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org