MergeBarcodes

Script: mergebarcodes.sh Package: jgi Class: MergeBarcodes.java

Concatenates barcodes and quality onto read names for demultiplexing workflows. Implements HashMap-based barcode lookup with O(1) retrieval performance for merging barcode sequences and quality scores into read identifiers.

Basic Usage

mergebarcodes.sh in=<file> out=<file> barcode=<file>

Input may be stdin or a fasta or fastq file, raw or gzipped. If you pipe via stdin/stdout, please include the file type; e.g. for gzipped fasta input, set in=stdin.fa.gz

Parameters

Parameters are organized by their function in the barcode merging process.

Input Parameters

in=<file>: Input reads. 'in=stdin.fq' will pipe from standard in.
bar=<file>: File containing barcodes. Also accepts 'barcode=<file>' or 'index=<file>'.
int=auto: (interleaved) If true, forces fastq input to be paired and interleaved.
qin=auto: ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.

Output Parameters

out=<file>: Write muxed sequences here. 'out=stdout.fa' will pipe to standard out.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
qout=auto: ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).

Processing Parameters

addslash=f: Add /1 and /2 to read names for paired-end data identification.
addcolon=f: Add 1: and 2: to read names for paired-end data identification.
rcomp=f: (reversecomplement) Reverse complement all reads before processing.
rcompmate=f: (reversecomplementmate) Reverse complement only mate reads (R2) before processing.

Other Parameters

pigz=t: Use pigz to compress. If argument is a number, that will set the number of pigz threads.
unpigz=t: Use pigz to decompress.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Barcode Merging

mergebarcodes.sh in=reads.fq bar=barcodes.fq out=merged.fq

Merge barcodes from barcodes.fq into the headers of reads.fq, creating merged.fq with barcode sequences and qualities prepended to read names.

Paired-end Data with Compression

mergebarcodes.sh in=reads_R1.fq.gz bar=index.fq.gz out=merged.fq.gz ziplevel=6

Process gzipped paired-end reads with index sequences, outputting compressed results with medium compression level.

Reverse Complement Processing

mergebarcodes.sh in=reads.fq bar=barcodes.fq out=merged.fq rcompmate=t

Merge barcodes while reverse complementing mate reads (R2), useful when the sequencing orientation requires adjustment.

Streaming with Quality Conversion

cat reads.fq | mergebarcodes.sh in=stdin.fq bar=barcodes.fq qin=64 qout=33 out=stdout.fq

Stream reads through mergebarcodes while converting quality scores from Illumina (64) to Sanger (33) format.

Algorithm Details

Implementation Architecture

MergeBarcodes implements a two-phase HashMap-based approach using ConcurrentReadInputStream and ConcurrentReadOutputStream classes from the stream package.

Phase 1 - Barcode Loading (loadBarcodes method)

HashMap Initialization: Creates HashMap<String, Read> with initial capacity 0x10000-1 (65535 entries)
Stream Processing: Uses ConcurrentReadInputStream.getReadInputStream() for parallel barcode file reading
Key Generation: Read IDs are processed using split(" ")[0] to handle space-containing identifiers
Storage Method: map.put(r1.id, r1) stores barcode Read objects with read ID as key

Phase 2 - Read Merging (mergeWithMap method)

Lookup Operation: map.remove(key) provides O(1) barcode retrieval and removes from memory
String Concatenation: Uses StringBuilder with prefix.append() for barcode integration
Quality Conversion: Implements (char)(b+33) to convert quality bytes to ASCII
Format Structure: [barcode_bases]_[quality_ascii]_[original_read_name]

Memory Management

Buffer Configuration: Shared.capBuffers(4) limits concurrent read buffers
Stream Architecture: ConcurrentReadOutputStream with buffer=4 for write operations
Memory Release: HashMap.remove() during processing prevents memory accumulation
Compression Support: ReadWrite.USE_PIGZ=true enables pigz compression

Processing Statistics

The implementation tracks and reports specific metrics:

Match Counting: barcodesFound and barcodesNotFound counters
Throughput Calculation: Tools.format("%.2fk reads/sec", rpnano*1000000)
Percentage Reporting: barcodesFound*100.0/readsProcessed for match rates
Processing Time: Timer class tracks elapsed nanoseconds

Read Processing Features

ID Handling: Automatic space trimming using key.split(" ")[0]
Reverse Complement: Optional r1.reverseComplement() and r2.reverseComplement() methods
Paired-end Support: Processes both r1 and r2.mate with consistent barcode prefixes
Format Detection: FileFormat.testInput() for automatic file type identification

Stream Concurrency Model

Input Streams: ConcurrentReadInputStream for parallel reading
Output Streams: ConcurrentReadOutputStream for concurrent writing
List Processing: ListNum<Read> containers for batch read handling
Thread Safety: StringBuilder.setLength(0) resets buffers between operations

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org