RemoveBadBarcodes

Basic Usage

removebadbarcodes.sh in=<file> out=<file>

This tool filters reads by character validation in barcode sequences, removing reads containing non-ACGT characters. It operates on standard Illumina format read headers where the barcode sequence follows the final colon delimiter.

Parameters

RemoveBadBarcodes uses a simple parameter set focused on input/output specification and compression options.

Parameters

in=<file>: Input reads file; required parameter. Accepts FASTA or FASTQ format, gzipped or uncompressed. The reads must have standard Illumina format headers with barcodes after the final colon.
out=<file>: Destination for reads passing barcode validation; optional. If not specified, results will be written to stdout. Only reads with barcodes containing ACGT and '+' characters are included.
ziplevel=2: (zl) Compression level for gzip output. Range 1-9 where 1 is fastest compression and 9 is maximum compression. Default is 2.
pigz=f: Spawn a pigz (parallel gzip) process for faster compression than Java's built-in gzip. Requires pigz to be installed on the system. Set to 't' or 'true' to enable.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx800m will specify 800 megs. The max is typically 85% of physical memory. Default memory allocation for this tool is 200MB.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing system instability in batch processing environments.
-da: Disable assertions. Can provide a small performance boost in production environments where debugging information is not needed.

Examples

Basic Barcode Filtering

removebadbarcodes.sh in=raw_reads.fastq out=clean_reads.fastq

Filters raw Illumina reads, removing any with invalid barcode characters and writing clean reads to output.

Compressed Output

removebadbarcodes.sh in=reads.fq.gz out=filtered.fq.gz ziplevel=6

Process compressed input and create compressed output with compression level 6.

Using Parallel Compression

removebadbarcodes.sh in=large_dataset.fastq out=cleaned.fastq.gz pigz=t

Enable pigz for parallel gzip compression when processing large datasets. Requires pigz to be installed.

High Memory Processing

removebadbarcodes.sh -Xmx4g in=reads.fastq out=filtered.fastq

Allocate 4GB of RAM for processing very large read files, though this tool typically requires minimal memory.

Algorithm Details

Barcode Validation Strategy

RemoveBadBarcodes implements barcode character validation through the processReadPair() method from BBTool_ST:

Read Header Processing

The algorithm examines each read's identifier string using String.lastIndexOf(':') to locate the barcode sequence. Illumina format places the barcode after this delimiter. If no colon is found (loc<0), or if the colon is at the end of the string (loc>=id.length()-1), the read increments the noBarcode counter and returns false.

Character Validation

For reads with identifiable barcodes, each character in the barcode sequence undergoes validation using two criteria:

Plus Character Exception: The '+' character is explicitly allowed (c=='+'), as it appears in some Illumina barcode formats
Nucleotide Validation: All other characters must pass the AminoAcid.isFullyDefined(c) test, which accepts standard DNA bases (A, C, G, T) and rejects ambiguous nucleotide codes

Quality Control Metrics

The algorithm maintains three long counters throughout processing in the showStatsSubclass() method:

Good Reads: Reads with valid barcodes containing only ACGT and '+' characters
Bad Reads: Reads with barcodes containing characters other than ACGT or '+' (ambiguous nucleotides, non-DNA characters)
No Barcode: Reads whose headers don't conform to expected Illumina format

Performance Characteristics

The tool uses single-threaded string parsing with the following measured characteristics:

Memory Usage: Default 200MB allocation (configurable via -Xmx) with streaming read processing
Processing Speed: Linear time complexity O(n) where n equals total header characters, iterating through barcode substring only
I/O Efficiency: Supports compressed input/output formats and parallel compression via external pigz process

Use Cases

RemoveBadBarcodes is applicable in these scenarios:

Quality Control: Pre-processing step before demultiplexing to remove reads with corrupted barcodes
Error Detection: Identifying problematic sequencing runs with high rates of barcode corruption
Pipeline Integration: Automated filtering in high-throughput processing workflows
Data Cleaning: Removing reads that would cause downstream tools to fail or produce incorrect results

Output Information

Upon completion, RemoveBadBarcodes reports summary statistics:

Good: Number of reads with valid barcodes that were retained
Bad: Number of reads with invalid barcode characters that were removed
No Barcode: Number of reads with headers that don't contain identifiable barcodes

These counters provide metrics for assessing barcode corruption rates and header format compliance.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org