TagAndMerge

Script: tagandmerge.sh Package: barcode Class: TagAndMerge.java

Accepts multiple input files from a demultiplexed lane. Parses the barcode from the filename and adds (tab)BARCODE to read headers. Outputs all reads into a single file. Optionally, trims bases and drops R2. Intended for evaluating demultiplexing methods.

Basic Usage

tagandmerge.sh *.fastq.gz out=<output file>

Or with explicit input specification:

tagandmerge.sh in=<file,file,file> out=<output file>

Input may be fasta or fastq, compressed or uncompressed.

Parameters

TagAndMerge processes multiple demultiplexed files simultaneously, extracting barcodes from filenames and adding them to read headers before consolidating all reads into a single output file. Parameters control input processing, output formatting, and barcode extraction behavior.

Standard parameters

in=<file,file>: A comma-delimited list of files. If wildcards are used, omit in= and the commas. Multiple files can be processed simultaneously.
out=<file>: Print all reads to this destination. All processed reads from all input files will be consolidated into this single output file.
barcodes=<file>: Print barcodes from file names to this destination. Creates a separate file listing all unique barcodes extracted from the input filenames.
trim=-1: If positive, trim all reads to this length. Default is -1 (no trimming). When set to a positive value, all reads will be trimmed to exactly this many bases.
dropr2=f: Discard read 2 if the input is interleaved. Set to true to keep only read 1 from paired-end data, effectively converting to single-end output.
shrinkheader=f: (shrink) Illumina only; remove unnecessary header fields. When enabled, condenses Illumina headers to essential information for reduced file size.
remap=-+: Remap symbols in the barcode. By default, '+' replaces '-' in barcode strings. To eliminate this remapping behavior, set 'remap=null'.

Examples

Basic Demultiplexing Evaluation

tagandmerge.sh path/*0.*.fastq.gz dropr2 trim out=tagged.fq.gz barcodes=bc.txt

Processes all files matching the pattern, drops R2 reads, trims reads, outputs consolidated reads to tagged.fq.gz and barcodes to bc.txt.

Processing Specific Files

tagandmerge.sh in=sample1.fq,sample2.fq,sample3.fq out=merged.fq barcodes=extracted_barcodes.txt

Processes three specific input files, extracts barcodes from their filenames, and consolidates all reads into merged.fq.

With Read Trimming

tagandmerge.sh *.fastq.gz out=consolidated.fq trim=100 shrinkheader=t

Processes all FASTQ files, trims all reads to exactly 100 bases, enables header shrinking for Illumina data, and outputs to consolidated.fq.

Custom Barcode Remapping

tagandmerge.sh *.fq out=tagged.fq remap=null

Processes files without any barcode symbol remapping (preserves original barcode characters as found in filenames).

Algorithm Details

Barcode Extraction and Processing

TagAndMerge uses dot-pattern filename parsing to extract barcodes from demultiplexed file names. The tool splits filenames on dots using Tools.dotPattern.split() and tests each segment with isBarcode() validation requiring minimum 6 bases (ACGTN) and at most 1 delimiter character (- or +).

Processing Workflow

The tool operates through several key phases:

File Discovery: Identifies all input files either through wildcard expansion or explicit file lists
Barcode Extraction: Parses each filename using Barcode.parseBarcodeFromFname() to identify embedded barcode sequences
Barcode Validation: Validates extracted strings using Barcode.isBarcode() to ensure they represent legitimate barcodes
Symbol Remapping: Applies character substitutions (default: '-' → '+') to normalize barcode representation
Header Modification: Appends extracted barcodes to read headers with tab separation
Optional Processing: Applies read trimming and R2 dropping if specified
Consolidation: Writes all processed reads to a single output file

Memory Management

The tool uses concurrent read processing with a default memory allocation of 300MB (-Xmx300m). Input files are processed sequentially to minimize memory footprint, with each file's reads processed in batches using ConcurrentReadInputStream and ConcurrentReadOutputStream with 4-buffer read/write streams for streaming I/O.

Header Processing Modes

TagAndMerge supports two header processing modes:

Standard Mode: Appends the complete barcode to the existing header with tab separation
Shrink Mode: For Illumina data, creates condensed headers using LineParserS4 to extract only essential coordinate information while still appending the barcode

Performance Characteristics

The tool processes demultiplexed files with specific memory and I/O patterns:

Sequential file processing with 300MB default memory allocation (-Xmx300m)
Concurrent I/O using ConcurrentReadInputStream and ConcurrentReadOutputStream with 4-buffer streams
Dot-split filename parsing without regular expressions - uses Tools.dotPattern.split() string tokenization
Optional read trimming using TrimRead.trimToPosition() with exact base position control

Quality Control

The tool includes validation and reporting mechanisms:

Validates that barcodes can be successfully extracted from all input filenames using assert(tag!=null) checks
Maintains a LinkedHashSet<String> of unique barcodes for duplicate detection and verification
Provides read/base statistics using Tools.timeReadsBasesProcessed() and Tools.readsBasesOut() reporting
Ensures input and output file names don't conflict using assert(!out1.equalsIgnoreCase(s)) validation

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org