TagAndMerge
Accepts multiple input files from a demultiplexed lane. Parses the barcode from the filename and adds (tab)BARCODE to read headers. Outputs all reads into a single file. Optionally, trims bases and drops R2. Intended for evaluating demultiplexing methods.
Basic Usage
tagandmerge.sh *.fastq.gz out=<output file>
Or with explicit input specification:
tagandmerge.sh in=<file,file,file> out=<output file>
Input may be fasta or fastq, compressed or uncompressed.
Parameters
TagAndMerge processes multiple demultiplexed files simultaneously, extracting barcodes from filenames and adding them to read headers before consolidating all reads into a single output file. Parameters control input processing, output formatting, and barcode extraction behavior.
Standard parameters
- in=<file,file>
- A comma-delimited list of files. If wildcards are used, omit in= and the commas. Multiple files can be processed simultaneously.
- out=<file>
- Print all reads to this destination. All processed reads from all input files will be consolidated into this single output file.
- barcodes=<file>
- Print barcodes from file names to this destination. Creates a separate file listing all unique barcodes extracted from the input filenames.
- trim=-1
- If positive, trim all reads to this length. Default is -1 (no trimming). When set to a positive value, all reads will be trimmed to exactly this many bases.
- dropr2=f
- Discard read 2 if the input is interleaved. Set to true to keep only read 1 from paired-end data, effectively converting to single-end output.
- shrinkheader=f
- (shrink) Illumina only; remove unnecessary header fields. When enabled, condenses Illumina headers to essential information for reduced file size.
- remap=-+
- Remap symbols in the barcode. By default, '+' replaces '-' in barcode strings. To eliminate this remapping behavior, set 'remap=null'.
Examples
Basic Demultiplexing Evaluation
tagandmerge.sh path/*0.*.fastq.gz dropr2 trim out=tagged.fq.gz barcodes=bc.txt
Processes all files matching the pattern, drops R2 reads, trims reads, outputs consolidated reads to tagged.fq.gz and barcodes to bc.txt.
Processing Specific Files
tagandmerge.sh in=sample1.fq,sample2.fq,sample3.fq out=merged.fq barcodes=extracted_barcodes.txt
Processes three specific input files, extracts barcodes from their filenames, and consolidates all reads into merged.fq.
With Read Trimming
tagandmerge.sh *.fastq.gz out=consolidated.fq trim=100 shrinkheader=t
Processes all FASTQ files, trims all reads to exactly 100 bases, enables header shrinking for Illumina data, and outputs to consolidated.fq.
Custom Barcode Remapping
tagandmerge.sh *.fq out=tagged.fq remap=null
Processes files without any barcode symbol remapping (preserves original barcode characters as found in filenames).
Algorithm Details
Barcode Extraction and Processing
TagAndMerge uses dot-pattern filename parsing to extract barcodes from demultiplexed file names. The tool splits filenames on dots using Tools.dotPattern.split()
and tests each segment with isBarcode()
validation requiring minimum 6 bases (ACGTN) and at most 1 delimiter character (- or +).
Processing Workflow
The tool operates through several key phases:
- File Discovery: Identifies all input files either through wildcard expansion or explicit file lists
- Barcode Extraction: Parses each filename using
Barcode.parseBarcodeFromFname()
to identify embedded barcode sequences - Barcode Validation: Validates extracted strings using
Barcode.isBarcode()
to ensure they represent legitimate barcodes - Symbol Remapping: Applies character substitutions (default: '-' → '+') to normalize barcode representation
- Header Modification: Appends extracted barcodes to read headers with tab separation
- Optional Processing: Applies read trimming and R2 dropping if specified
- Consolidation: Writes all processed reads to a single output file
Memory Management
The tool uses concurrent read processing with a default memory allocation of 300MB (-Xmx300m). Input files are processed sequentially to minimize memory footprint, with each file's reads processed in batches using ConcurrentReadInputStream
and ConcurrentReadOutputStream
with 4-buffer read/write streams for streaming I/O.
Header Processing Modes
TagAndMerge supports two header processing modes:
- Standard Mode: Appends the complete barcode to the existing header with tab separation
- Shrink Mode: For Illumina data, creates condensed headers using
LineParserS4
to extract only essential coordinate information while still appending the barcode
Performance Characteristics
The tool processes demultiplexed files with specific memory and I/O patterns:
- Sequential file processing with 300MB default memory allocation (-Xmx300m)
- Concurrent I/O using
ConcurrentReadInputStream
andConcurrentReadOutputStream
with 4-buffer streams - Dot-split filename parsing without regular expressions - uses
Tools.dotPattern.split()
string tokenization - Optional read trimming using
TrimRead.trimToPosition()
with exact base position control
Quality Control
The tool includes validation and reporting mechanisms:
- Validates that barcodes can be successfully extracted from all input filenames using
assert(tag!=null)
checks - Maintains a
LinkedHashSet<String>
of unique barcodes for duplicate detection and verification - Provides read/base statistics using
Tools.timeReadsBasesProcessed()
andTools.readsBasesOut()
reporting - Ensures input and output file names don't conflict using
assert(!out1.equalsIgnoreCase(s))
validation
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org