RemoveBadBarcodes

Script: removebadbarcodes.sh Package: jgi Class: RemoveBadBarcodes.java

Removes reads with barcodes containing non-ACGT bases. Read headers must be in standard Illumina format.

Basic Usage

removebadbarcodes.sh in=<file> out=<file>

This tool filters reads by character validation in barcode sequences, removing reads containing non-ACGT characters. It operates on standard Illumina format read headers where the barcode sequence follows the final colon delimiter.

Parameters

RemoveBadBarcodes uses a simple parameter set focused on input/output specification and compression options.

Parameters

in=<file>
Input reads file; required parameter. Accepts FASTA or FASTQ format, gzipped or uncompressed. The reads must have standard Illumina format headers with barcodes after the final colon.
out=<file>
Destination for reads passing barcode validation; optional. If not specified, results will be written to stdout. Only reads with barcodes containing ACGT and '+' characters are included.
ziplevel=2
(zl) Compression level for gzip output. Range 1-9 where 1 is fastest compression and 9 is maximum compression. Default is 2.
pigz=f
Spawn a pigz (parallel gzip) process for faster compression than Java's built-in gzip. Requires pigz to be installed on the system. Set to 't' or 'true' to enable.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx800m will specify 800 megs. The max is typically 85% of physical memory. Default memory allocation for this tool is 200MB.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing system instability in batch processing environments.
-da
Disable assertions. Can provide a small performance boost in production environments where debugging information is not needed.

Examples

Basic Barcode Filtering

removebadbarcodes.sh in=raw_reads.fastq out=clean_reads.fastq

Filters raw Illumina reads, removing any with invalid barcode characters and writing clean reads to output.

Compressed Output

removebadbarcodes.sh in=reads.fq.gz out=filtered.fq.gz ziplevel=6

Process compressed input and create compressed output with compression level 6.

Using Parallel Compression

removebadbarcodes.sh in=large_dataset.fastq out=cleaned.fastq.gz pigz=t

Enable pigz for parallel gzip compression when processing large datasets. Requires pigz to be installed.

High Memory Processing

removebadbarcodes.sh -Xmx4g in=reads.fastq out=filtered.fastq

Allocate 4GB of RAM for processing very large read files, though this tool typically requires minimal memory.

Algorithm Details

Barcode Validation Strategy

RemoveBadBarcodes implements barcode character validation through the processReadPair() method from BBTool_ST:

Read Header Processing

The algorithm examines each read's identifier string using String.lastIndexOf(':') to locate the barcode sequence. Illumina format places the barcode after this delimiter. If no colon is found (loc<0), or if the colon is at the end of the string (loc>=id.length()-1), the read increments the noBarcode counter and returns false.

Character Validation

For reads with identifiable barcodes, each character in the barcode sequence undergoes validation using two criteria:

Quality Control Metrics

The algorithm maintains three long counters throughout processing in the showStatsSubclass() method:

Performance Characteristics

The tool uses single-threaded string parsing with the following measured characteristics:

Use Cases

RemoveBadBarcodes is applicable in these scenarios:

Output Information

Upon completion, RemoveBadBarcodes reports summary statistics:

These counters provide metrics for assessing barcode corruption rates and header format compliance.

Support

For questions and support: