FilterBarcodes

Script: filterbarcodes.sh Package: barcode Class: CorrelateBarcodes.java

Filters barcodes by quality, and generates quality histograms.

Basic Usage

filterbarcodes.sh in=<file> out=<file> maq=<integer>

This tool processes reads that have already been multiplexed with barcode qualities using mergebarcodes.sh. It filters reads based on barcode quality thresholds and generates quality histograms for analysis.

Parameters

Parameters are organized by their function in the barcode filtering process. All parameters from the shell script usage function are documented below.

Input parameters

in=<file>
Reads that have already been muxed with barcode qualities using mergebarcodes.sh. Required input parameter.
int=auto
(interleaved) If true, forces fastq input to be paired and interleaved. Default: auto
qin=auto
ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto. Default: auto

Output parameters

out=<file>
Write filtered reads here. 'out=stdout.fq' will pipe to standard out.
cor=<file>
Correlation between read and index qualities. Outputs tab-delimited correlation data with read quality vs barcode quality statistics.
bqhist=<file>
Barcode quality histogram by position. Generates position-specific quality distribution data.
baqhist=<file>
Barcode average quality histogram. Shows distribution of average quality scores across all barcodes.
bmqhist=<file>
Barcode minimum quality histogram. Shows distribution of minimum quality scores found in barcodes.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true
ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Default: 2
fastawrap=80
Length of lines in fasta output. Default: 80
qout=auto
ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input). Default: auto
maq=0
Filter reads with barcode average quality less than this. Quality threshold for average barcode quality filtering. Default: 0 (no filtering)
mmq=0
Filter reads with barcode minimum quality less than this. Quality threshold for minimum barcode quality filtering. Default: 0 (no filtering)

Other parameters

pigz=t
Use pigz to compress. If argument is a number, that will set the number of pigz threads. Default: true
unpigz=t
Use pigz to decompress. Default: true

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 200m for this tool
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Quality Filtering

filterbarcodes.sh in=multiplexed_reads.fq out=filtered_reads.fq maq=20

Filter reads to keep only those with barcode average quality ≥20. This removes low-quality barcoded reads that could introduce errors in downstream analysis.

Generate Quality Histograms

filterbarcodes.sh in=barcoded_reads.fq out=clean_reads.fq \
  baqhist=barcode_avg_qual.txt bmqhist=barcode_min_qual.txt bqhist=barcode_pos_qual.txt

Generate quality histograms: average quality distribution, minimum quality distribution, and position-specific quality histograms for barcode analysis.

Quality Correlation Analysis

filterbarcodes.sh in=sample_reads.fq out=filtered_reads.fq \
  cor=quality_correlation.txt maq=15 mmq=10

Filter reads with dual quality thresholds (average ≥15, minimum ≥10) and generate correlation data between read qualities and barcode qualities for QC assessment.

Strict Quality Filtering with Compression

filterbarcodes.sh in=raw_barcoded.fq.gz out=high_quality.fq.gz \
  maq=25 mmq=15 ziplevel=6 pigz=4

Apply strict quality filtering (average ≥25, minimum ≥15) with higher compression level and 4 pigz threads for processing large datasets.

Algorithm Details

FilterBarcodes uses the CorrelateBarcodes class to perform quality assessment and filtering of barcoded sequencing reads:

Barcode Quality Extraction

The tool parses multiplexed read identifiers to extract barcode sequences and their corresponding quality scores. Barcode information is encoded in the read ID format created by mergebarcodes.sh, with underscore-separated barcode sequence and quality strings. The parsing uses r1.id.split("_") to extract components, with quality values adjusted by subtracting 33 for ASCII offset conversion.

Quality Metrics Calculation

For each barcode, the tool calculates two key quality metrics:

Correlation Analysis

The tool maintains two 50×50 correlation matrices (qualCor1 and qualCor2) to track relationships between read qualities and barcode qualities for paired-end data. For each read, r1.avgQualityByProbabilityInt(true, 0) calculates read quality and increments the appropriate matrix cell at qualCor1[q1][qbar]++. This enables assessment of whether barcode quality is predictive of read quality.

Quality Filtering Strategy

Reads are filtered using dual thresholds with boolean logic:

Histogram Generation

The tool generates three types of quality histograms using fixed-size arrays:

Memory Management

The tool uses fixed-size data structures to maintain constant memory usage regardless of dataset size:

Statistical Output

Upon completion, the tool reports processing statistics:

Support

For questions and support: