TagAndMerge

Script: tagandmerge.sh Package: barcode Class: TagAndMerge.java

Accepts multiple input files from a demultiplexed lane. Parses the barcode from the filename and adds (tab)BARCODE to read headers. Outputs all reads into a single file. Optionally, trims bases and drops R2. Intended for evaluating demultiplexing methods.

Basic Usage

tagandmerge.sh *.fastq.gz out=<output file>

Or with explicit input specification:

tagandmerge.sh in=<file,file,file> out=<output file>

Input may be fasta or fastq, compressed or uncompressed.

Parameters

TagAndMerge processes multiple demultiplexed files simultaneously, extracting barcodes from filenames and adding them to read headers before consolidating all reads into a single output file. Parameters control input processing, output formatting, and barcode extraction behavior.

Standard parameters

in=<file,file>
A comma-delimited list of files. If wildcards are used, omit in= and the commas. Multiple files can be processed simultaneously.
out=<file>
Print all reads to this destination. All processed reads from all input files will be consolidated into this single output file.
barcodes=<file>
Print barcodes from file names to this destination. Creates a separate file listing all unique barcodes extracted from the input filenames.
trim=-1
If positive, trim all reads to this length. Default is -1 (no trimming). When set to a positive value, all reads will be trimmed to exactly this many bases.
dropr2=f
Discard read 2 if the input is interleaved. Set to true to keep only read 1 from paired-end data, effectively converting to single-end output.
shrinkheader=f
(shrink) Illumina only; remove unnecessary header fields. When enabled, condenses Illumina headers to essential information for reduced file size.
remap=-+
Remap symbols in the barcode. By default, '+' replaces '-' in barcode strings. To eliminate this remapping behavior, set 'remap=null'.

Examples

Basic Demultiplexing Evaluation

tagandmerge.sh path/*0.*.fastq.gz dropr2 trim out=tagged.fq.gz barcodes=bc.txt

Processes all files matching the pattern, drops R2 reads, trims reads, outputs consolidated reads to tagged.fq.gz and barcodes to bc.txt.

Processing Specific Files

tagandmerge.sh in=sample1.fq,sample2.fq,sample3.fq out=merged.fq barcodes=extracted_barcodes.txt

Processes three specific input files, extracts barcodes from their filenames, and consolidates all reads into merged.fq.

With Read Trimming

tagandmerge.sh *.fastq.gz out=consolidated.fq trim=100 shrinkheader=t

Processes all FASTQ files, trims all reads to exactly 100 bases, enables header shrinking for Illumina data, and outputs to consolidated.fq.

Custom Barcode Remapping

tagandmerge.sh *.fq out=tagged.fq remap=null

Processes files without any barcode symbol remapping (preserves original barcode characters as found in filenames).

Algorithm Details

Barcode Extraction and Processing

TagAndMerge uses dot-pattern filename parsing to extract barcodes from demultiplexed file names. The tool splits filenames on dots using Tools.dotPattern.split() and tests each segment with isBarcode() validation requiring minimum 6 bases (ACGTN) and at most 1 delimiter character (- or +).

Processing Workflow

The tool operates through several key phases:

Memory Management

The tool uses concurrent read processing with a default memory allocation of 300MB (-Xmx300m). Input files are processed sequentially to minimize memory footprint, with each file's reads processed in batches using ConcurrentReadInputStream and ConcurrentReadOutputStream with 4-buffer read/write streams for streaming I/O.

Header Processing Modes

TagAndMerge supports two header processing modes:

Performance Characteristics

The tool processes demultiplexed files with specific memory and I/O patterns:

Quality Control

The tool includes validation and reporting mechanisms:

Support

For questions and support: