RemoveBadBarcodes
Removes reads with barcodes containing non-ACGT bases. Read headers must be in standard Illumina format.
Basic Usage
removebadbarcodes.sh in=<file> out=<file>
This tool filters reads by character validation in barcode sequences, removing reads containing non-ACGT characters. It operates on standard Illumina format read headers where the barcode sequence follows the final colon delimiter.
Parameters
RemoveBadBarcodes uses a simple parameter set focused on input/output specification and compression options.
Parameters
- in=<file>
- Input reads file; required parameter. Accepts FASTA or FASTQ format, gzipped or uncompressed. The reads must have standard Illumina format headers with barcodes after the final colon.
- out=<file>
- Destination for reads passing barcode validation; optional. If not specified, results will be written to stdout. Only reads with barcodes containing ACGT and '+' characters are included.
- ziplevel=2
- (zl) Compression level for gzip output. Range 1-9 where 1 is fastest compression and 9 is maximum compression. Default is 2.
- pigz=f
- Spawn a pigz (parallel gzip) process for faster compression than Java's built-in gzip. Requires pigz to be installed on the system. Set to 't' or 'true' to enable.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx800m will specify 800 megs. The max is typically 85% of physical memory. Default memory allocation for this tool is 200MB.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing system instability in batch processing environments.
- -da
- Disable assertions. Can provide a small performance boost in production environments where debugging information is not needed.
Examples
Basic Barcode Filtering
removebadbarcodes.sh in=raw_reads.fastq out=clean_reads.fastq
Filters raw Illumina reads, removing any with invalid barcode characters and writing clean reads to output.
Compressed Output
removebadbarcodes.sh in=reads.fq.gz out=filtered.fq.gz ziplevel=6
Process compressed input and create compressed output with compression level 6.
Using Parallel Compression
removebadbarcodes.sh in=large_dataset.fastq out=cleaned.fastq.gz pigz=t
Enable pigz for parallel gzip compression when processing large datasets. Requires pigz to be installed.
High Memory Processing
removebadbarcodes.sh -Xmx4g in=reads.fastq out=filtered.fastq
Allocate 4GB of RAM for processing very large read files, though this tool typically requires minimal memory.
Algorithm Details
Barcode Validation Strategy
RemoveBadBarcodes implements barcode character validation through the processReadPair() method from BBTool_ST:
Read Header Processing
The algorithm examines each read's identifier string using String.lastIndexOf(':') to locate the barcode sequence. Illumina format places the barcode after this delimiter. If no colon is found (loc<0), or if the colon is at the end of the string (loc>=id.length()-1), the read increments the noBarcode counter and returns false.
Character Validation
For reads with identifiable barcodes, each character in the barcode sequence undergoes validation using two criteria:
- Plus Character Exception: The '+' character is explicitly allowed (c=='+'), as it appears in some Illumina barcode formats
- Nucleotide Validation: All other characters must pass the
AminoAcid.isFullyDefined(c)
test, which accepts standard DNA bases (A, C, G, T) and rejects ambiguous nucleotide codes
Quality Control Metrics
The algorithm maintains three long counters throughout processing in the showStatsSubclass() method:
- Good Reads: Reads with valid barcodes containing only ACGT and '+' characters
- Bad Reads: Reads with barcodes containing characters other than ACGT or '+' (ambiguous nucleotides, non-DNA characters)
- No Barcode: Reads whose headers don't conform to expected Illumina format
Performance Characteristics
The tool uses single-threaded string parsing with the following measured characteristics:
- Memory Usage: Default 200MB allocation (configurable via -Xmx) with streaming read processing
- Processing Speed: Linear time complexity O(n) where n equals total header characters, iterating through barcode substring only
- I/O Efficiency: Supports compressed input/output formats and parallel compression via external pigz process
Use Cases
RemoveBadBarcodes is applicable in these scenarios:
- Quality Control: Pre-processing step before demultiplexing to remove reads with corrupted barcodes
- Error Detection: Identifying problematic sequencing runs with high rates of barcode corruption
- Pipeline Integration: Automated filtering in high-throughput processing workflows
- Data Cleaning: Removing reads that would cause downstream tools to fail or produce incorrect results
Output Information
Upon completion, RemoveBadBarcodes reports summary statistics:
- Good: Number of reads with valid barcodes that were retained
- Bad: Number of reads with invalid barcode characters that were removed
- No Barcode: Number of reads with headers that don't contain identifiable barcodes
These counters provide metrics for assessing barcode corruption rates and header format compliance.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org