MergeBarcodes
Concatenates barcodes and quality onto read names for demultiplexing workflows. Implements HashMap-based barcode lookup with O(1) retrieval performance for merging barcode sequences and quality scores into read identifiers.
Basic Usage
mergebarcodes.sh in=<file> out=<file> barcode=<file>
Input may be stdin or a fasta or fastq file, raw or gzipped. If you pipe via stdin/stdout, please include the file type; e.g. for gzipped fasta input, set in=stdin.fa.gz
Parameters
Parameters are organized by their function in the barcode merging process.
Input Parameters
- in=<file>
- Input reads. 'in=stdin.fq' will pipe from standard in.
- bar=<file>
- File containing barcodes. Also accepts 'barcode=<file>' or 'index=<file>'.
- int=auto
- (interleaved) If true, forces fastq input to be paired and interleaved.
- qin=auto
- ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.
Output Parameters
- out=<file>
- Write muxed sequences here. 'out=stdout.fa' will pipe to standard out.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
- qout=auto
- ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).
Processing Parameters
- addslash=f
- Add /1 and /2 to read names for paired-end data identification.
- addcolon=f
- Add 1: and 2: to read names for paired-end data identification.
- rcomp=f
- (reversecomplement) Reverse complement all reads before processing.
- rcompmate=f
- (reversecomplementmate) Reverse complement only mate reads (R2) before processing.
Other Parameters
- pigz=t
- Use pigz to compress. If argument is a number, that will set the number of pigz threads.
- unpigz=t
- Use pigz to decompress.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Barcode Merging
mergebarcodes.sh in=reads.fq bar=barcodes.fq out=merged.fq
Merge barcodes from barcodes.fq into the headers of reads.fq, creating merged.fq with barcode sequences and qualities prepended to read names.
Paired-end Data with Compression
mergebarcodes.sh in=reads_R1.fq.gz bar=index.fq.gz out=merged.fq.gz ziplevel=6
Process gzipped paired-end reads with index sequences, outputting compressed results with medium compression level.
Reverse Complement Processing
mergebarcodes.sh in=reads.fq bar=barcodes.fq out=merged.fq rcompmate=t
Merge barcodes while reverse complementing mate reads (R2), useful when the sequencing orientation requires adjustment.
Streaming with Quality Conversion
cat reads.fq | mergebarcodes.sh in=stdin.fq bar=barcodes.fq qin=64 qout=33 out=stdout.fq
Stream reads through mergebarcodes while converting quality scores from Illumina (64) to Sanger (33) format.
Algorithm Details
Implementation Architecture
MergeBarcodes implements a two-phase HashMap-based approach using ConcurrentReadInputStream and ConcurrentReadOutputStream classes from the stream package.
Phase 1 - Barcode Loading (loadBarcodes method)
- HashMap Initialization: Creates HashMap<String, Read> with initial capacity 0x10000-1 (65535 entries)
- Stream Processing: Uses ConcurrentReadInputStream.getReadInputStream() for parallel barcode file reading
- Key Generation: Read IDs are processed using split(" ")[0] to handle space-containing identifiers
- Storage Method: map.put(r1.id, r1) stores barcode Read objects with read ID as key
Phase 2 - Read Merging (mergeWithMap method)
- Lookup Operation: map.remove(key) provides O(1) barcode retrieval and removes from memory
- String Concatenation: Uses StringBuilder with prefix.append() for barcode integration
- Quality Conversion: Implements (char)(b+33) to convert quality bytes to ASCII
- Format Structure: [barcode_bases]_[quality_ascii]_[original_read_name]
Memory Management
- Buffer Configuration: Shared.capBuffers(4) limits concurrent read buffers
- Stream Architecture: ConcurrentReadOutputStream with buffer=4 for write operations
- Memory Release: HashMap.remove() during processing prevents memory accumulation
- Compression Support: ReadWrite.USE_PIGZ=true enables pigz compression
Processing Statistics
The implementation tracks and reports specific metrics:
- Match Counting: barcodesFound and barcodesNotFound counters
- Throughput Calculation: Tools.format("%.2fk reads/sec", rpnano*1000000)
- Percentage Reporting: barcodesFound*100.0/readsProcessed for match rates
- Processing Time: Timer class tracks elapsed nanoseconds
Read Processing Features
- ID Handling: Automatic space trimming using key.split(" ")[0]
- Reverse Complement: Optional r1.reverseComplement() and r2.reverseComplement() methods
- Paired-end Support: Processes both r1 and r2.mate with consistent barcode prefixes
- Format Detection: FileFormat.testInput() for automatic file type identification
Stream Concurrency Model
- Input Streams: ConcurrentReadInputStream for parallel reading
- Output Streams: ConcurrentReadOutputStream for concurrent writing
- List Processing: ListNum<Read> containers for batch read handling
- Thread Safety: StringBuilder.setLength(0) resets buffers between operations
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org