NovaDemux

Script: novademux.sh Package: barcode Class: NovaDemux.java

Demultiplexes sequencer reads into multiple files based on their barcodes. Uses statistical analysis to ensure optimal yield and minimal crosstalk in the presence of errors. Barcodes (indexes) must be embedded in read headers, and the expected barcodes must be provided as a text file with one barcode (or barcode pair) per line.

Basic Usage

novademux.sh in=reads.fq out=out_%.fq outu=unknown.fq expected=barcodes.txt

For Twin Files

novademux.sh in=in_#.fq out=out_%_#.fq.gz outu=unk_#.fq expected=barcodes.txt

Parameters

Parameters are organized by their function in the demultiplexing process. NovaDemux provides PCRMatrix-based probabilistic assignment using position-specific error probability matrices for accurate barcode assignment with minimal crosstalk.

File Parameters

in=<file>: Input file. The primary input file containing reads to be demultiplexed.
in2=<file>: If input reads are paired in twin files, use in2 for the second file. You can alternatively use the # symbol, e.g. 'in=read_#.fastq.gz', which is equivalent to 'in1=read_1.fastq.gz in2=read_2.fastq.gz'.
out=<file>: Output files for reads with matched headers (must contain % symbol). For example, out=out_%.fq with indexes XX and YY would create out_XX.fq and out_YY.fq. If twin files for paired reads are desired, use the # symbol. For example, out=out_%_#.fq in this case would create out_XX_1.fq, out_XX_2.fq, out_YY_1.fq, and out_YY_2.fq.
outu=<file>: Output file for reads with unmatched headers. Reads that cannot be confidently assigned to any barcode will be written here.
stats=<file>: Print statistics about how many reads went to each file. Provides detailed counts and yield information.
expected=: List of barcodes (or files containing containing barcodes) to parse from read headers. Files should contain one barcode per line. For example, 'expected=barcodes.txt' or 'expected=ACGTACGT,GGTTAACC,AACCGGTT'. This list must contain all pooled barcodes to ensure accuracy, including PhiX if present.
writeempty=t: Write empty files for expected but absent barcodes. Default: true. Set to false to avoid creating files for barcodes with zero reads.
subset=: Optional list of barcodes when only some output files are desired; only demultiplex these libraries. Comma-separated list of specific barcodes to process.
nosplit=f: When true, dump all reads to outu instead of individual files. Useful for labeling reads without creating separate files. Default: false.
rename=f: When true, append the assigned barcode (or 'unknown') to each read header, after a tab. Can be used in conjunction with nosplit to simply label reads. Default: false.
rc1=f: Reverse-complement index1 from expected and samplemap. Set to true if the first barcode needs reverse complementing. Default: false.
rc2=f: Reverse-complement index2 from expected and samplemap. Set to true if the second barcode needs reverse complementing. Default: false.
addpolyg=f: It is recommended to set this to true on a platform where no signal is read as G. This will add poly-G as a dummy expected barcode. If no signal yields a different base call, use the appropriate flag (addpolyc, etc). Default: false.
remap=: Change symbols for output filenames. For example, remap=+- would output barcode ACGT+TGCA to file ACGT-TCGA.fq.gz. Useful for filesystems that don't support certain characters.

Legacy Output Stats File Support Parameters

legacy=: Set this to a path like '.' to output legacy stats files. Creates multiple CSV files compatible with older demultiplexing workflows.
samplemap=: An input csv or tsv containing barcodes and sample names, for legacy stats. If present 'expected' can be omitted. Format should have barcode and sample name columns.
lane=0: Set this to a number to print the lane in legacy files. Used for multi-lane flowcell processing. Default: 0.

Barcode Parsing Mode Parameters (choose one)

barcode: Parse the barcode automatically, assuming the standard Illumina header format. This is the default mode and works with most modern sequencing data.
header: Match the entire read header. Use this mode when the entire header serves as the identifier.
prefix: Match the prefix of the read header (length must be set). Extracts a fixed number of characters from the beginning of each header.
suffix: Match the suffix of the read header (length must be set). Extracts a fixed number of characters from the end of each header.
hdelimiter=: (headerdelimiter) Split the header using this delimiter, then select a term (column must be set). Normally the delimiter will be used as a literal string (a Java regular expression); for example, ':' or 'HISEQ'. But there are some special delimiters which will be replaced by the symbol they name, because they can cause problems. These are provided for convenience due to OS conflicts: space, tab, whitespace, pound, greaterthan, lessthan, equals, colon, semicolon, bang, and, quote, singlequote. These are provided because they interfere with Java regular expression syntax: backslash, hat, dollar, dot, pipe, questionmark, star, plus, openparen, closeparen, opensquare, opencurly. In other words, to match '.', you should set 'hdelimiter=dot'.
length=0: For prefix or suffix mode, use this many characters from the read header. Must be positive in these modes. Default: 0.
column=0: Select the term when using a header delimiter. This is 1-based (first term is column 1) so it must be positive. Default: 0.

Barcode Assignment Mode Parameters (choose one)

mode=prob: prob: Default mode. Assigns reads to the bin where they most likely belong, from gathering statistics across the pool. Uses PCRMatrix position-specific error probability matrices with Bayesian inference for barcode assignment. tile: Similar to prob, but calculates statistics on a per-tile basis for higher precision. This mode is recommended as long as the tile numbers are in the read headers. hdist: Demultiplex reads to the bin with the fewest mismatches. This is the fastest and least accurate mode. Here, 'hdist' stands for 'Hamming distance'. Note: prob and tile mode may require a license.

Server Parameters (for prob Mode only)

server=auto: true: Barcode counts are sent to a remote server for processing, and barcode assignments are sent back. false: Barcode counts are processed locally. auto: Sets flag to false unless the local machine contains proprietary probabilistic processing code. Default: auto.

Sensitivity Cutoff Parameters for Prob/Tile Mode

maxhdist=6: Maximum Hamming distance (number of mismatches) allowed. Lower values will reduce yield with little benefit. Default: 6 for prob/tile mode.
pairhdist=f: When true, maxhdist will apply to the Hamming distance of both barcodes combined (if using dual indexes). When false, maxhdist will apply to each barcode individually. Default: false.
minratio=1m: Minimum ratio of probabilities allowed; k/m/b suffixes are allowed. ratio=1m will only assign a barcode to a bin if it is at least 1 million times more likely to belong in that bin than in all others combined. Lower values will increase yield but may increase crosstalk. Default: 1m.
minprob=-5.6: Discard reads with a lower probability than this of belonging to any bin. This is log10-scale, so -5 means 10^-5=0.00001. Lower values will increase yield but increase crosstalk. E.g., -6 would be lower than -5. Default: -5.6.
matrixthreads=1: More threads is faster but adds nondeterminism. Controls parallelism in probabilistic calculations. Default: 1.

Note: These cutoffs are optimized for dual 10bp indexes. For single 10bp indexes, 'minratio=5000 minprob=-3.2' is recommended.

Sensitivity Cutoff Parameters for HDist Mode

maxhdist=1: Maximum Hamming distance (number of mismatches) allowed. Lower values will reduce yield and decrease crosstalk. Setting maxhdist=0 will allow exact matches only. Default: 1 for hdist mode.
pairhdist=f: When true, maxhdist will apply to the Hamming distance of both barcodes combined (if using dual indexes). When false, maxhdist will apply to each barcode individually. Default: false.
clearzone=1: (cz) Minimum difference between the closest and second-closest Hamming distances. For example, AAAG is 1 mismatch from AAAA and 3 mismatches away from GGGG, for a margin of 3-1=2. This would be demultiplexed into AAAA as long as the clearzone is set to at most 2. Lower values increase both yield and crosstalk. Default: 1.

Buffering Parameters

streams=8: Allow at most this many active streams. The actual number of open files will be 1 greater than this if outu is set, and doubled if output is paired and written in twin files instead of interleaved. Setting this to at least the number of expected output files can make things go much faster. Default: 8.
minreads=0: Don't create a file for fewer than this many reads; instead, send them to unknown. This option will incur additional memory usage as reads must be retained until processing is complete. Default: 0.
rpb=8000: Dump buffers to files when they fill with this many reads. Higher can be faster; lower uses less memory. Default: 8000.
bpb=8000000: Dump buffers to files when they contain this many bytes. Higher can be faster; lower uses less memory. Default: 8000000 (8MB).

Spike-in Processing Parameters (particularly for spike-ins with no barcodes)

spikelabel=: If and only if a spike-in label is set here, reads will be aligned to a reference, and matching reads will be sent to the file with this label. May be a barcode or other string. Useful for PhiX or other control sequences.
refpath=phix: Override this with a file path for a custom reference. Default path points to the built-in PhiX sequence. Default: phix.
kspike=27: Use this kmer length to map reads to the spike-in reference. Longer kmers are more specific but less sensitive. Default: 27.
minid=0.7: Identity cutoff for matching the reference. Reads must have at least this fraction identity to be considered spike-in matches. Default: 0.7 (70%).
mapall=f: Map all reads to the reference, instead of just unassigned reads. When false, only reads that don't match expected barcodes are tested against spike-in. Default: false.

Common Parameters

ow=t: (overwrite) Overwrites files that already exist. Set to false to prevent accidental overwriting. Default: true.
zl=4: (ziplevel) Set compression level, 1 (low) to 9 (max). Higher levels provide better compression but use more CPU time. Default: 4.
int=auto: (interleaved) Determines whether INPUT file is considered interleaved. Auto-detection works in most cases. Default: auto.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions. May provide minor performance improvement in production environments.

Examples

Basic Demultiplexing

# Simple single-end demultiplexing
novademux.sh in=reads.fq out=demux_%.fq outu=unmatched.fq expected=barcodes.txt

# Paired-end demultiplexing with twin files
novademux.sh in=reads_#.fq out=demux_%_#.fq outu=unmatched_#.fq expected=barcodes.txt

Basic demultiplexing using standard Illumina barcode parsing with statistical assignment.

High Precision Mode

# Use tile-based statistics for higher precision
novademux.sh in=reads.fq out=demux_%.fq outu=unmatched.fq expected=barcodes.txt mode=tile

# Stricter cutoffs to minimize crosstalk
novademux.sh in=reads.fq out=demux_%.fq outu=unmatched.fq expected=barcodes.txt \
    mode=prob minratio=10m minprob=-6

High precision demultiplexing with per-tile statistics and strict probability cutoffs.

Fast Hamming Distance Mode

# Fast mode using Hamming distance
novademux.sh in=reads.fq out=demux_%.fq outu=unmatched.fq expected=barcodes.txt \
    mode=hdist maxhdist=1 clearzone=2

# Exact matches only
novademux.sh in=reads.fq out=demux_%.fq outu=unmatched.fq expected=barcodes.txt \
    mode=hdist maxhdist=0

Fast demultiplexing using simple Hamming distance calculations.

Custom Header Parsing

# Parse barcode from specific header field
novademux.sh in=reads.fq out=demux_%.fq outu=unmatched.fq expected=barcodes.txt \
    hdelimiter=: column=4

# Use prefix of header as barcode
novademux.sh in=reads.fq out=demux_%.fq outu=unmatched.fq expected=barcodes.txt \
    prefix=t length=8

Custom header parsing for non-standard barcode formats.

With PhiX Spike-in

# Demultiplex with PhiX control
novademux.sh in=reads.fq out=demux_%.fq outu=unmatched.fq expected=barcodes.txt \
    spikelabel=PhiX refpath=phix minid=0.8

Demultiplexing with PhiX spike-in control detection and removal.

Legacy Output Files

# Generate legacy CSV files
novademux.sh in=reads.fq out=demux_%.fq outu=unmatched.fq \
    samplemap=samples.csv legacy=. lane=1

Generate legacy-format output files for compatibility with older analysis pipelines.

Algorithm Details

Statistical Demultiplexing Engine

NovaDemux implements PCRMatrix-based probabilistic algorithms for accurate barcode assignment while minimizing crosstalk between samples. The tool operates through several key processing stages:

Barcode Counting and Statistics

The algorithm first performs a complete pass through the input data to count all observed barcodes. This creates a frequency distribution HashMap that reveals both expected barcodes and potential index-hopping events. The BarcodeCounter class implements HashMap-based counting with ConcurrentReadInputStream for parallel processing of large datasets.

PCR Matrix Analysis

For probabilistic modes (prob and tile), NovaDemux uses the PCRMatrix framework to model the expected distribution of barcodes based on:

Expected barcode frequencies: Based on the provided barcode list
Error modeling: Statistical modeling of sequencing and PCR errors
Crosstalk prediction: Analysis of potential index-hopping between samples
Tile-specific effects: Per-tile bias correction when tile mode is enabled

Assignment Algorithms

NovaDemux provides three distinct assignment algorithms:

Probabilistic Mode (prob): Uses Bayesian inference to assign each read to the most likely barcode based on the observed frequency distribution and error model. Calculates likelihood ratios and applies minimum probability thresholds to minimize false assignments.

Tile Mode (tile): Extends probabilistic mode with per-tile statistics. Illumina flowcells can exhibit tile-specific biases in barcode recovery, and this mode models these effects separately for each tile position, providing higher accuracy for datasets with tile information in headers.

Hamming Distance Mode (hdist): Simple distance-based assignment that matches reads to the barcode with the fewest mismatches. Uses clearzone parameters to require a minimum distance margin between the best and second-best matches, preventing ambiguous assignments.

Memory and Performance Optimization

The implementation uses several optimization strategies:

BufferedMultiCros: Multi-file output stream management with configurable read and byte buffer thresholds (rpb/bpb parameters)
Stream management: Dynamic file handle management to support large numbers of output files
Parallel processing: Multi-threaded barcode analysis with configurable thread counts
Memory-conscious design: Streaming processing with configurable buffer sizes to handle large datasets

Quality Control Features

NovaDemux includes integrated quality control mechanisms:

Spike-in detection: MicroAligner2 integration for PhiX and custom spike-in detection
Statistics reporting: Detailed yield and crosstalk analysis
Legacy compatibility: Support for historical analysis pipeline formats
Validation checks: Input parameter validation and error handling with errorState tracking

Performance Characteristics

Memory usage scales linearly with the number of unique barcodes observed. CPU usage depends on the selected algorithm mode, with hdist being fastest, prob providing best accuracy, and tile offering the highest precision for Illumina data. The tool can process millions of reads per minute on modern hardware while maintaining high accuracy.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org