DemuxByName

Script: demuxbyname.sh Package: jgi Class: DemuxByName2.java

Demultiplexes sequences into multiple files based on their names, substrings of their names, or prefixes or suffixes of their names. Allows unlimited output files while maintaining only a small number of open file handles.

Basic Usage

demuxbyname.sh in=<file> in2=<file> out=<file> out2=<file> names=<string,string,...>

Alternate Usage Patterns

Barcode demultiplexing:

demuxbyname.sh in=<file> out=<file> outu=<file> names=<file> barcode

Parse barcodes from Illumina reads with headers like: @A00178:73:HH7H3DSXX:4:1101:13666:1047 1:N:0:ACGTTGGT+TGACGCAT

Whitespace-delimited demultiplexing:

demuxbyname.sh in=<file> out=<file> delimiter=whitespace prefixmode=f

Demultiplex by the substring after the last whitespace.

Fixed-length prefix demultiplexing:

demuxbyname.sh in=<file> out=<file> length=8 prefixmode=t

Demultiplex by the first 8 characters of read names.

Colon-delimited suffix demultiplexing:

demuxbyname.sh in=<file> out=<file> delimiter=: prefixmode=f

Split on colons and use the last substring as the name (useful for Illumina barcode demultiplexing).

Parameters

Parameters are organized by their function in the demultiplexing process. If input is paired and there is only one output file, it will be written interleaved. The tool uses pattern substitution where % in filenames is replaced by the demultiplexing key and # is replaced by read number (1 or 2) for paired files.

File Parameters

in=<file>: Input file. Primary sequence file to demultiplex.
in2=<file>: If input reads are paired in twin files, use in2 for the second file. Optional for paired-end data.
out=<file>: Output files for reads with matched headers (must contain % symbol). For example, out=out_%.fq with names XX and YY would create out_XX.fq and out_YY.fq. If twin files for paired reads are desired, use the # symbol. For example, out=out_%_#.fq in this case would create out_XX_1.fq, out_XX_2.fq, out_YY_1.fq, and out_YY_2.fq.
outu=<file>: Output file for reads with unmatched headers. Sequences that do not match any specified pattern will be written here.
stats=<file>: Print statistics about how many reads went to each file. Outputs tab-delimited summary with read and base counts per output file.
names=: List of strings (or files containing strings) to parse from read names. Files should contain one name per line. This is optional. If a list of names is provided, files will only be created for those names. For example, 'prefixmode=t length=5' would create a file for every unique first 5 characters in read names, and every read would be written to one of those files. But if there was addionally 'names=ABCDE,FGHIJ' then at most 2 files would be created, and anything not matching those names would go to outu.

Processing Mode Parameters (determine how to convert a read into a name)

prefixmode=t: (pm) Match prefix of read header. If false, match suffix of read header. prefixmode=f is equivalent to suffixmode=t.
barcode=f: Parse barcodes from Illumina headers. Automatically extracts barcode sequences from standard Illumina header format.
tile=f: Parse tile numbers from Illumina headers. Uses IlluminaHeaderParser2 to extract tile information for tile-based demultiplexing.
chrom=f: For mapped sam files, make one file per chromosome (scaffold) using the rname. Creates separate output files for each reference sequence.
header=f: Use the entire sequence header as the demultiplexing key. Each unique header creates its own output file.
delimiter=: For prefix or suffix mode, specifying a delimiter will allow exact matches even if the length is variable. This allows demultiplexing based on names that are found without specifying a list of names. In suffix mode, everything after the last delimiter will be used. Normally the delimiter will be used as a literal string (a Java regular expression); for example, ':' or 'HISEQ'. But there are some special delimiters which will be replaced by the symbol they name, because they can cause problems. These are provided for convenience due to OS conflicts: space, tab, whitespace, pound, greaterthan, lessthan, equals, colon, semicolon, bang, and, quote, singlequote These are provided because they interfere with Java regular expression syntax: backslash, hat, dollar, dot, pipe, questionmark, star, plus, openparen, closeparen, opensquare, opencurly In other words, to match '.', you should set 'delimiter=dot'.
substring=f: Names can be substrings of read headers. Substring mode is slow if the list of names is large. Requires a list of names. Uses brute-force string matching.

Other Processing Parameters

column=-1: If positive, split the header on a delimiter and match that column (1-based). For example, using this header: NB501886:61:HL3GMAFXX:1:11101:10717:1140 1:N:0:ACTGAGC+ATTAGAC You could demux by tile (11101) using 'delimiter=: column=5' Column is 1-based (first column is 1). If column is omitted when a delimiter is present, prefixmode will use the first substring, and suffixmode will use the last substring.
length=0: If positive, use a suffix or prefix of this length from read name instead of or in addition to the list of names. For example, you could create files based on the first 8 characters of read names.
hdist=0: Allow a hamming distance for demultiplexing barcodes. This requires a list of names (barcodes). It is unrelated to probability mode's hdist3. Supports both unified and dual barcode hamming distance matching with mutant generation for approximate matching.
replace=: Replace some characters in the output filenames. For example, replace=+- would replace the + symbol in headers with the - symbol in output filenames. So you could match the barcode ACTGAGC+ATTAGAC, but write to file ACTGAGC-ATTAGAC.

Buffering Parameters

streams=8: Allow at most this many active streams. The actual number of open files will be 1 greater than this if outu is set, and doubled if output is paired and written in twin files instead of interleaved. Setting this to at least the number of expected output files can make things go much faster.
minreads=0: Don't create a file for fewer than this many reads; instead, send them to unknown. This option will incur additional memory usage as reads must be buffered until processing is complete.
rpb=8000: Dump buffers to files when they fill with this many reads. Higher can be faster; lower uses less memory. Controls the read buffer size for the BufferedMultiCros system.
bpb=8000000: Dump buffers to files when they contain this many bytes. Higher can be faster; lower uses less memory. Controls the byte buffer size for the BufferedMultiCros system.

Common parameters

ow=t: (overwrite) Overwrites files that already exist.
zl=4: (ziplevel) Set compression level, 1 (low) to 9 (max). Automatically detects and uses pigz/bgzip when available for faster compression.
int=auto: (interleaved) Determines whether INPUT file is considered interleaved. Auto-detection based on file format and paired-end status.
qin=auto: ASCII offset for input quality. All modern platforms use 33. Auto-detected from quality score range.
qout=auto: ASCII offset for output quality. Typically matches input quality encoding.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions for slightly improved performance.

Examples

Basic Name-Based Demultiplexing

demuxbyname.sh in=reads.fq out=sample_%.fq names=sample1,sample2,sample3 outu=unmatched.fq

Demultiplexes reads into files sample_sample1.fq, sample_sample2.fq, sample_sample3.fq based on exact string matches in read headers. Unmatched reads go to unmatched.fq.

Barcode Demultiplexing with Hamming Distance

demuxbyname.sh in=reads.fq out=bc_%.fq names=ATCG,GCTA,TGCA,CGAT hdist=1 outu=unknown.fq

Allows 1 mismatch when matching barcodes, useful for handling sequencing errors in barcode sequences.

Illumina Barcode Parsing

demuxbyname.sh in=reads.fq out=barcode_%.fq barcode=t outu=nobarcodes.fq

Automatically extracts barcodes from Illumina headers and creates separate files for each unique barcode.

Paired-End Demultiplexing

demuxbyname.sh in1=reads_R1.fq in2=reads_R2.fq out=sample_%_#.fq names=A,B,C

Creates paired output files: sample_A_1.fq, sample_A_2.fq, sample_B_1.fq, sample_B_2.fq, sample_C_1.fq, sample_C_2.fq.

Prefix-Based Demultiplexing with Fixed Length

demuxbyname.sh in=reads.fq out=prefix_%.fq length=6 prefixmode=t

Creates separate files for each unique 6-character prefix found in read names.

Delimiter-Based Column Extraction

demuxbyname.sh in=reads.fq out=tile_%.fq delimiter=: column=5

For Illumina headers like "NB501886:61:HL3GMAFXX:1:11101:10717:1140", extracts column 5 (tile number "11101") for demultiplexing.

Algorithm Details

Scalable File Handle Management

DemuxByName2 is specifically designed to handle very large numbers of output files with a fixed number of file handles, making it ideal for demultiplexing Illumina NovaSeq runs or other high-throughput applications. The tool uses a BufferedMultiCros system that maintains only a small number of active file streams while supporting unlimited output files.

Buffering Strategy

The algorithm employs a dual-buffering approach:

Read-based buffering (rpb): Buffers are flushed when they contain the specified number of reads (default 8000)
Byte-based buffering (bpb): Buffers are flushed when they exceed the specified byte threshold (default 8MB)
This approach balances memory usage with I/O performance, allowing efficient processing of datasets with thousands of output files

Processing Modes

The tool supports seven distinct processing modes:

AFFIX_MODE (1): Uses prefixes or suffixes of specified length
DELIMITER_MODE (2): Splits headers on delimiters with optional column selection
BARCODE_MODE (3): Parses Illumina-format barcodes using IlluminaHeaderParser2
SUBSTRING_MODE (4): Brute-force substring matching (slower for large name lists)
HEADER_MODE (5): Uses entire header as key
CHROM_MODE (6): Uses chromosome/scaffold names from SAM files
TILE_MODE (7): Extracts tile numbers from Illumina headers

Hamming Distance Support

For barcode demultiplexing, the tool supports hamming distance matching with the following features:

Mutant generation: Precomputes all possible mutants within the specified hamming distance
Collision detection: Automatically removes ambiguous mutants that could match multiple barcodes
Dual barcode support: For paired barcodes, can handle either unified distance or separate distances for each barcode
Memory optimization: For large barcode sets, uses separate left/right tables to prevent memory explosion

Performance Optimizations

Precompiled patterns: Delimiter patterns are compiled once for faster matching
Character-based delimiter matching: Single-character delimiters use faster indexOf operations
Threaded I/O: Uses concurrent read/write streams for improved throughput
Compression optimization: Automatically detects and uses pigz/bgzip when available
Memory management: Adaptive buffer sizing based on available system memory

Error Handling and Statistics

The tool provides comprehensive error handling and statistics:

Column validation: Warns when requested columns exceed available data
Delimiter validation: Verifies delimiter presence in headers
Cardinality tracking: Optional tracking of unique sequences per output file
Comprehensive statistics: Reports read/base counts for all output files

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org