MergeBarcodes

Script: mergebarcodes.sh Package: jgi Class: MergeBarcodes.java

Concatenates barcodes and quality onto read names for demultiplexing workflows. Implements HashMap-based barcode lookup with O(1) retrieval performance for merging barcode sequences and quality scores into read identifiers.

Basic Usage

mergebarcodes.sh in=<file> out=<file> barcode=<file>

Input may be stdin or a fasta or fastq file, raw or gzipped. If you pipe via stdin/stdout, please include the file type; e.g. for gzipped fasta input, set in=stdin.fa.gz

Parameters

Parameters are organized by their function in the barcode merging process.

Input Parameters

in=<file>
Input reads. 'in=stdin.fq' will pipe from standard in.
bar=<file>
File containing barcodes. Also accepts 'barcode=<file>' or 'index=<file>'.
int=auto
(interleaved) If true, forces fastq input to be paired and interleaved.
qin=auto
ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.

Output Parameters

out=<file>
Write muxed sequences here. 'out=stdout.fa' will pipe to standard out.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file.
ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
qout=auto
ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).

Processing Parameters

addslash=f
Add /1 and /2 to read names for paired-end data identification.
addcolon=f
Add 1: and 2: to read names for paired-end data identification.
rcomp=f
(reversecomplement) Reverse complement all reads before processing.
rcompmate=f
(reversecomplementmate) Reverse complement only mate reads (R2) before processing.

Other Parameters

pigz=t
Use pigz to compress. If argument is a number, that will set the number of pigz threads.
unpigz=t
Use pigz to decompress.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Barcode Merging

mergebarcodes.sh in=reads.fq bar=barcodes.fq out=merged.fq

Merge barcodes from barcodes.fq into the headers of reads.fq, creating merged.fq with barcode sequences and qualities prepended to read names.

Paired-end Data with Compression

mergebarcodes.sh in=reads_R1.fq.gz bar=index.fq.gz out=merged.fq.gz ziplevel=6

Process gzipped paired-end reads with index sequences, outputting compressed results with medium compression level.

Reverse Complement Processing

mergebarcodes.sh in=reads.fq bar=barcodes.fq out=merged.fq rcompmate=t

Merge barcodes while reverse complementing mate reads (R2), useful when the sequencing orientation requires adjustment.

Streaming with Quality Conversion

cat reads.fq | mergebarcodes.sh in=stdin.fq bar=barcodes.fq qin=64 qout=33 out=stdout.fq

Stream reads through mergebarcodes while converting quality scores from Illumina (64) to Sanger (33) format.

Algorithm Details

Implementation Architecture

MergeBarcodes implements a two-phase HashMap-based approach using ConcurrentReadInputStream and ConcurrentReadOutputStream classes from the stream package.

Phase 1 - Barcode Loading (loadBarcodes method)

Phase 2 - Read Merging (mergeWithMap method)

Memory Management

Processing Statistics

The implementation tracks and reports specific metrics:

Read Processing Features

Stream Concurrency Model

Support

For questions and support: