FilterBarcodes
Filters barcodes by quality, and generates quality histograms.
Basic Usage
filterbarcodes.sh in=<file> out=<file> maq=<integer>
This tool processes reads that have already been multiplexed with barcode qualities using mergebarcodes.sh. It filters reads based on barcode quality thresholds and generates quality histograms for analysis.
Parameters
Parameters are organized by their function in the barcode filtering process. All parameters from the shell script usage function are documented below.
Input parameters
- in=<file>
- Reads that have already been muxed with barcode qualities using mergebarcodes.sh. Required input parameter.
- int=auto
- (interleaved) If true, forces fastq input to be paired and interleaved. Default: auto
- qin=auto
- ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto. Default: auto
Output parameters
- out=<file>
- Write filtered reads here. 'out=stdout.fq' will pipe to standard out.
- cor=<file>
- Correlation between read and index qualities. Outputs tab-delimited correlation data with read quality vs barcode quality statistics.
- bqhist=<file>
- Barcode quality histogram by position. Generates position-specific quality distribution data.
- baqhist=<file>
- Barcode average quality histogram. Shows distribution of average quality scores across all barcodes.
- bmqhist=<file>
- Barcode minimum quality histogram. Shows distribution of minimum quality scores found in barcodes.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Default: 2
- fastawrap=80
- Length of lines in fasta output. Default: 80
- qout=auto
- ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input). Default: auto
- maq=0
- Filter reads with barcode average quality less than this. Quality threshold for average barcode quality filtering. Default: 0 (no filtering)
- mmq=0
- Filter reads with barcode minimum quality less than this. Quality threshold for minimum barcode quality filtering. Default: 0 (no filtering)
Other parameters
- pigz=t
- Use pigz to compress. If argument is a number, that will set the number of pigz threads. Default: true
- unpigz=t
- Use pigz to decompress. Default: true
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 200m for this tool
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Quality Filtering
filterbarcodes.sh in=multiplexed_reads.fq out=filtered_reads.fq maq=20
Filter reads to keep only those with barcode average quality ≥20. This removes low-quality barcoded reads that could introduce errors in downstream analysis.
Generate Quality Histograms
filterbarcodes.sh in=barcoded_reads.fq out=clean_reads.fq \
baqhist=barcode_avg_qual.txt bmqhist=barcode_min_qual.txt bqhist=barcode_pos_qual.txt
Generate quality histograms: average quality distribution, minimum quality distribution, and position-specific quality histograms for barcode analysis.
Quality Correlation Analysis
filterbarcodes.sh in=sample_reads.fq out=filtered_reads.fq \
cor=quality_correlation.txt maq=15 mmq=10
Filter reads with dual quality thresholds (average ≥15, minimum ≥10) and generate correlation data between read qualities and barcode qualities for QC assessment.
Strict Quality Filtering with Compression
filterbarcodes.sh in=raw_barcoded.fq.gz out=high_quality.fq.gz \
maq=25 mmq=15 ziplevel=6 pigz=4
Apply strict quality filtering (average ≥25, minimum ≥15) with higher compression level and 4 pigz threads for processing large datasets.
Algorithm Details
FilterBarcodes uses the CorrelateBarcodes class to perform quality assessment and filtering of barcoded sequencing reads:
Barcode Quality Extraction
The tool parses multiplexed read identifiers to extract barcode sequences and their corresponding quality scores. Barcode information is encoded in the read ID format created by mergebarcodes.sh, with underscore-separated barcode sequence and quality strings. The parsing uses r1.id.split("_")
to extract components, with quality values adjusted by subtracting 33 for ASCII offset conversion.
Quality Metrics Calculation
For each barcode, the tool calculates two key quality metrics:
- Average Quality: Probability-weighted average of all base qualities in the barcode using
Read.avgQualityByProbabilityInt(barbases, barquals, true, 0)
- Minimum Quality: Lowest individual base quality found in the barcode sequence using
Tools.min(barquals)
Correlation Analysis
The tool maintains two 50×50 correlation matrices (qualCor1
and qualCor2
) to track relationships between read qualities and barcode qualities for paired-end data. For each read, r1.avgQualityByProbabilityInt(true, 0)
calculates read quality and increments the appropriate matrix cell at qualCor1[q1][qbar]++
. This enables assessment of whether barcode quality is predictive of read quality.
Quality Filtering Strategy
Reads are filtered using dual thresholds with boolean logic:
- Reads with barcode average quality below
minBarcodeAverageQuality
(maq parameter) are discarded - Reads with barcode minimum quality below
minBarcodeMinQuality
(mmq parameter) are discarded - The filtering condition is:
if(qbar<minBarcodeAverageQuality || minqbar<minBarcodeMinQuality)
- When filtered, reads are marked discarded with
r1.setDiscarded(true)
Histogram Generation
The tool generates three types of quality histograms using fixed-size arrays:
- Average Quality Histogram (
aqhistArray
): 100-element array tracking distribution of average qualities across all barcodes withaqhistArray[qbar]++
- Minimum Quality Histogram (
mqhistArray
): 100-element array tracking distribution of minimum qualities across all barcodes withmqhistArray[minqbar]++
- Position-specific Quality Histogram: Quality distribution at each position within barcodes via
readstats.addToQualityHistogram(barquals, 0)
when ReadStats is enabled
Memory Management
The tool uses fixed-size data structures to maintain constant memory usage regardless of dataset size:
- Quality histogram arrays:
long[] aqhistArray=new long[100]
andlong[] mqhistArray=new long[100]
- Correlation matrices:
long[][] qualCor1=new long[50][50]
andlong[][] qualCor2=new long[50][50]
- Quality scores are capped at practical ranges (0-99 for histograms, 0-49 for correlation) to fit these arrays
Statistical Output
Upon completion, the tool reports processing statistics:
- Total reads and bases processed using
Tools.timeReadsBasesProcessed(t, readsProcessed, basesProcessed, 8)
- Number and percentage of reads discarded:
readsTossed*100.0/readsProcessed
- Correlation data written with format:
"#Read1_Q\tBar_Q\tstdev\tcount\tRead2_Q\tBar_Q\tstdev\tcount\n"
- Histogram data written with headers:
"#Quality\tcount\tfraction\n"
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org