DemuxByName

Script: demuxbyname.sh Package: jgi Class: DemuxByName2.java

Demultiplexes sequences into multiple files based on their names, substrings of their names, or prefixes or suffixes of their names. Allows unlimited output files while maintaining only a small number of open file handles.

Basic Usage

demuxbyname.sh in=<file> in2=<file> out=<file> out2=<file> names=<string,string,...>

Alternate Usage Patterns

Barcode demultiplexing:

demuxbyname.sh in=<file> out=<file> outu=<file> names=<file> barcode

Parse barcodes from Illumina reads with headers like: @A00178:73:HH7H3DSXX:4:1101:13666:1047 1:N:0:ACGTTGGT+TGACGCAT

Whitespace-delimited demultiplexing:

demuxbyname.sh in=<file> out=<file> delimiter=whitespace prefixmode=f

Demultiplex by the substring after the last whitespace.

Fixed-length prefix demultiplexing:

demuxbyname.sh in=<file> out=<file> length=8 prefixmode=t

Demultiplex by the first 8 characters of read names.

Colon-delimited suffix demultiplexing:

demuxbyname.sh in=<file> out=<file> delimiter=: prefixmode=f

Split on colons and use the last substring as the name (useful for Illumina barcode demultiplexing).

Parameters

Parameters are organized by their function in the demultiplexing process. If input is paired and there is only one output file, it will be written interleaved. The tool uses pattern substitution where % in filenames is replaced by the demultiplexing key and # is replaced by read number (1 or 2) for paired files.

File Parameters

in=<file>
Input file. Primary sequence file to demultiplex.
in2=<file>
If input reads are paired in twin files, use in2 for the second file. Optional for paired-end data.
out=<file>
Output files for reads with matched headers (must contain % symbol). For example, out=out_%.fq with names XX and YY would create out_XX.fq and out_YY.fq. If twin files for paired reads are desired, use the # symbol. For example, out=out_%_#.fq in this case would create out_XX_1.fq, out_XX_2.fq, out_YY_1.fq, and out_YY_2.fq.
outu=<file>
Output file for reads with unmatched headers. Sequences that do not match any specified pattern will be written here.
stats=<file>
Print statistics about how many reads went to each file. Outputs tab-delimited summary with read and base counts per output file.
names=
List of strings (or files containing strings) to parse from read names. Files should contain one name per line. This is optional. If a list of names is provided, files will only be created for those names. For example, 'prefixmode=t length=5' would create a file for every unique first 5 characters in read names, and every read would be written to one of those files. But if there was addionally 'names=ABCDE,FGHIJ' then at most 2 files would be created, and anything not matching those names would go to outu.

Processing Mode Parameters (determine how to convert a read into a name)

prefixmode=t
(pm) Match prefix of read header. If false, match suffix of read header. prefixmode=f is equivalent to suffixmode=t.
barcode=f
Parse barcodes from Illumina headers. Automatically extracts barcode sequences from standard Illumina header format.
tile=f
Parse tile numbers from Illumina headers. Uses IlluminaHeaderParser2 to extract tile information for tile-based demultiplexing.
chrom=f
For mapped sam files, make one file per chromosome (scaffold) using the rname. Creates separate output files for each reference sequence.
header=f
Use the entire sequence header as the demultiplexing key. Each unique header creates its own output file.
delimiter=
For prefix or suffix mode, specifying a delimiter will allow exact matches even if the length is variable. This allows demultiplexing based on names that are found without specifying a list of names. In suffix mode, everything after the last delimiter will be used. Normally the delimiter will be used as a literal string (a Java regular expression); for example, ':' or 'HISEQ'. But there are some special delimiters which will be replaced by the symbol they name, because they can cause problems. These are provided for convenience due to OS conflicts: space, tab, whitespace, pound, greaterthan, lessthan, equals, colon, semicolon, bang, and, quote, singlequote These are provided because they interfere with Java regular expression syntax: backslash, hat, dollar, dot, pipe, questionmark, star, plus, openparen, closeparen, opensquare, opencurly In other words, to match '.', you should set 'delimiter=dot'.
substring=f
Names can be substrings of read headers. Substring mode is slow if the list of names is large. Requires a list of names. Uses brute-force string matching.

Other Processing Parameters

column=-1
If positive, split the header on a delimiter and match that column (1-based). For example, using this header: NB501886:61:HL3GMAFXX:1:11101:10717:1140 1:N:0:ACTGAGC+ATTAGAC You could demux by tile (11101) using 'delimiter=: column=5' Column is 1-based (first column is 1). If column is omitted when a delimiter is present, prefixmode will use the first substring, and suffixmode will use the last substring.
length=0
If positive, use a suffix or prefix of this length from read name instead of or in addition to the list of names. For example, you could create files based on the first 8 characters of read names.
hdist=0
Allow a hamming distance for demultiplexing barcodes. This requires a list of names (barcodes). It is unrelated to probability mode's hdist3. Supports both unified and dual barcode hamming distance matching with mutant generation for approximate matching.
replace=
Replace some characters in the output filenames. For example, replace=+- would replace the + symbol in headers with the - symbol in output filenames. So you could match the barcode ACTGAGC+ATTAGAC, but write to file ACTGAGC-ATTAGAC.

Buffering Parameters

streams=8
Allow at most this many active streams. The actual number of open files will be 1 greater than this if outu is set, and doubled if output is paired and written in twin files instead of interleaved. Setting this to at least the number of expected output files can make things go much faster.
minreads=0
Don't create a file for fewer than this many reads; instead, send them to unknown. This option will incur additional memory usage as reads must be buffered until processing is complete.
rpb=8000
Dump buffers to files when they fill with this many reads. Higher can be faster; lower uses less memory. Controls the read buffer size for the BufferedMultiCros system.
bpb=8000000
Dump buffers to files when they contain this many bytes. Higher can be faster; lower uses less memory. Controls the byte buffer size for the BufferedMultiCros system.

Common parameters

ow=t
(overwrite) Overwrites files that already exist.
zl=4
(ziplevel) Set compression level, 1 (low) to 9 (max). Automatically detects and uses pigz/bgzip when available for faster compression.
int=auto
(interleaved) Determines whether INPUT file is considered interleaved. Auto-detection based on file format and paired-end status.
qin=auto
ASCII offset for input quality. All modern platforms use 33. Auto-detected from quality score range.
qout=auto
ASCII offset for output quality. Typically matches input quality encoding.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions for slightly improved performance.

Examples

Basic Name-Based Demultiplexing

demuxbyname.sh in=reads.fq out=sample_%.fq names=sample1,sample2,sample3 outu=unmatched.fq

Demultiplexes reads into files sample_sample1.fq, sample_sample2.fq, sample_sample3.fq based on exact string matches in read headers. Unmatched reads go to unmatched.fq.

Barcode Demultiplexing with Hamming Distance

demuxbyname.sh in=reads.fq out=bc_%.fq names=ATCG,GCTA,TGCA,CGAT hdist=1 outu=unknown.fq

Allows 1 mismatch when matching barcodes, useful for handling sequencing errors in barcode sequences.

Illumina Barcode Parsing

demuxbyname.sh in=reads.fq out=barcode_%.fq barcode=t outu=nobarcodes.fq

Automatically extracts barcodes from Illumina headers and creates separate files for each unique barcode.

Paired-End Demultiplexing

demuxbyname.sh in1=reads_R1.fq in2=reads_R2.fq out=sample_%_#.fq names=A,B,C

Creates paired output files: sample_A_1.fq, sample_A_2.fq, sample_B_1.fq, sample_B_2.fq, sample_C_1.fq, sample_C_2.fq.

Prefix-Based Demultiplexing with Fixed Length

demuxbyname.sh in=reads.fq out=prefix_%.fq length=6 prefixmode=t

Creates separate files for each unique 6-character prefix found in read names.

Delimiter-Based Column Extraction

demuxbyname.sh in=reads.fq out=tile_%.fq delimiter=: column=5

For Illumina headers like "NB501886:61:HL3GMAFXX:1:11101:10717:1140", extracts column 5 (tile number "11101") for demultiplexing.

Algorithm Details

Scalable File Handle Management

DemuxByName2 is specifically designed to handle very large numbers of output files with a fixed number of file handles, making it ideal for demultiplexing Illumina NovaSeq runs or other high-throughput applications. The tool uses a BufferedMultiCros system that maintains only a small number of active file streams while supporting unlimited output files.

Buffering Strategy

The algorithm employs a dual-buffering approach:

Processing Modes

The tool supports seven distinct processing modes:

Hamming Distance Support

For barcode demultiplexing, the tool supports hamming distance matching with the following features:

Performance Optimizations

Error Handling and Statistics

The tool provides comprehensive error handling and statistics:

Support

For questions and support: