CalcTrueQuality

Script: calctruequality.sh Package: jgi Class: CalcTrueQuality.java

Calculates observed quality scores from mapped sam/bam files. Generates matrices for use in recalibrating quality scores. By default, the matrices are written to /ref/qual/ in the current directory.

Description

CalcTrueQuality analyzes mapped reads to calculate the actual accuracy of reported quality scores, then generates recalibration matrices to improve quality score accuracy. The tool performs statistical analysis on alignment data to determine how often bases with specific quality scores are actually correct or incorrect under various contextual conditions.

For diploid organisms with high heterozygosity rates, the tool may produce inaccurate recalibration at high quality scores unless SNP locations are masked or variations are called. For example, recalibrating human reads mapped to an unmasked human reference would generate a maximum Q-score of roughly 30 due to the human 1/1000 SNP rate.

If you have multiple sam/bam files demultiplexed from a single sequencing run, it is recommended to use all of them as input for increased statistical power. Once matrices are generated, recalibration can be done on mapped or unmapped reads. Better results may be obtained by recalibrating the fastq files and remapping the calibrated reads.

Basic Usage

calctruequality.sh in=<sam,sam,...sam> path=<directory>

Two-Step Process

Step 1: Generate matrices using CalcTrueQuality

calctruequality.sh in=mapped.sam path=calibration_dir

Step 2: Recalibrate reads using BBDuk

bbduk.sh in=reads.fq out=calibrated.fq recalibrate

Parameters

Parameters are organized by their function in the quality calibration process. Each parameter group serves a specific role in matrix generation and quality score recalibration.

Input parameters

in=<file,file>: Sam/bam file or comma-delimited list of files. Alignments must use = and X cigar symbols, or have MD tags, or ref must be specified.
reads=-1: Stop after processing this many reads (if positive). Use -1 to process all reads.
samstreamer=t: (ss) Load reads multithreaded to increase speed. Can specify thread count with samstreamer=4.
unpigz=t: Use pigz to decompress compressed input files.

Output parameters

overwrite=t: (ow) Set to true to allow overwriting of existing files.
path=.: Directory to write quality matrices (within /ref subdir). Matrices will be written to path/ref/qual/.
write=t: Write matrices to disk. Set to false to skip matrix writing (useful for statistics only).
showstats=t: Print a summary of processed reads and statistics.
pigz=f: Use pigz to compress output matrix files.

Other parameters

t=auto: Number of worker threads for processing. Auto-detects optimal thread count.
passes=2: Recalibration passes, 1 or 2. Pass 2 is slower but gives more accurate quality scores by using pass 1 results to refine calculations.
recalqmax=42: Adjust max quality scores tracked. The actual highest quality score allowed is recalqmax-1 (default max Q41).
trackall=f: Track all available quality metrics and produce all matrices, including ones not selected for quality adjustment. Reduces speed but allows testing different recalibration matrices.
indels=t: Include indels in quality calculations. When false, only substitution errors are considered.
usetiles=f: Use per-tile quality statistics to generate matrices. If true, this flag must also be used during recalibration (e.g. in BBDuk). Useful for Illumina data with tile-specific quality issues.

Variation calling parameters

varfile=<file>: Use the variants in this var file, instead of calling variants. The format can be produced by CallVariants.
vcf=<file>: Use the variants in this VCF file, instead of calling variants.
callvars=f: Call SNPs, and do not count them as errors. Essential for diploid organisms with high heterozygosity.
ploidy=1: Set the organism's ploidy for variant calling. Use 2 for diploid organisms.
ref=: Reference sequence file, required for variation-calling. Must be indexed if using variant calling.

Matrix-selection parameters

loadq102=: For each recalibration matrix, enable or disable that matrix with t/f. You can specify pass1 or pass2 like this: loadq102_p1=f loadq102_p2=t. The default is loadqbp_p1=t loadqbp_p2=t loadqb123_p1=t.
clearmatrices=f: If true, clear all existing matrix selections. For example: 'clearmatrices loadqbp_p1' would ignore defaults and select only qbp for the first pass.

Available matrix type parameters

q102: Quality, leading quality, trailing quality. Considers quality scores of neighboring bases.
qap: Quality, average quality, position. Uses average read quality and position within read.
qbp: Quality, current base, position. Most commonly used matrix considering base identity and position.
q10: Quality, leading quality. Considers the quality score of the previous base.
q12: Quality, trailing quality. Considers the quality score of the following base.
qb12: Quality, leading base, current base. Considers base identity context.
qb012: Quality, two leading bases, current base. Extended base context including two preceding bases.
qb123: Quality, leading base, current base, trailing base. Three-base context window.
qb234: Quality, current base, two trailing bases. Extended context including two following bases.
q12b12: Quality, trailing quality, leading base, current base. Combines quality and base context.
qp: Quality, position. Simple position-dependent recalibration.
q: Current quality score only. Basic quality-dependent recalibration without context.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions to slightly improve performance.

Examples

Basic Quality Calibration

calctruequality.sh in=aligned.bam path=calibration_data

Generates quality calibration matrices from a mapped BAM file, writing matrices to calibration_data/ref/qual/.

Multiple Input Files

calctruequality.sh in=sample1.bam,sample2.bam,sample3.bam path=merged_calibration

Uses multiple BAM files from the same sequencing run to increase statistical power for matrix generation.

Diploid Organism with Variant Calling

calctruequality.sh in=human.bam ref=human_genome.fa callvars=t ploidy=2 path=human_calibration

Calibrates human reads while calling variants to avoid counting true SNPs as sequencing errors.

Single Pass Calibration

calctruequality.sh in=reads.bam passes=1 trackall=t path=quick_calibration

Faster single-pass calibration while generating all matrix types for comparison testing.

Using Pre-existing Variants

calctruequality.sh in=sample.bam vcf=known_variants.vcf path=variant_aware_calibration

Uses known variants from a VCF file to avoid counting true variants as sequencing errors.

Algorithm Details

Quality Score Recalibration Strategy

CalcTrueQuality implements a multi-dimensional matrix-based approach using separate "good" and "bad" count matrices (GBMatrixSet class) to track correct and incorrect base calls. The modify() method combines observed error rates with expected error rates using QualityTools.PROB_ERROR lookup tables and observation cutoff weighting to ensure statistical reliability when observation counts are low.

Matrix-Based Recalibration

The algorithm maintains parallel good/bad matrices with specific dimensional structures for different contextual factors. Each matrix type uses multi-dimensional long arrays with predefined size limits (QMAX2=42 for quality, BMAX=6 for bases, LENMAX=500 for position, TMAX=2048 for tiles):

Base Context Matrices: qbpGoodMatrix[2][42][6][500] and qb123GoodMatrix[2][42][6][6][6] track accuracy using baseToNum[] lookup table conversion (A=0, C=1, G=2, T=3, U=3, E=4, other=5)
Quality Context Matrices: q102GoodMatrix[2][42][42][42] considers quality scores of neighboring bases using Tools.mid(QMAX, quals[position], 0) boundary clamping
Position Matrices: qapGoodMatrix[2][42][43][500] accounts for position-within-read effects using Tools.min(qpos, LENMAX-1) position capping
Tile Matrices: qptGoodMatrix[2][42][500][2048] uses parseTile() method to extract tile numbers from Illumina read names via colon-separated field parsing

Two-Pass Recalibration

The two-pass system uses initializeMatrices() to load CountMatrixSet objects for each pass, then applies recalibrate() with pass-specific boolean flags (pass0, pass1) to iteratively refine quality scores. Pass 1 matrices are used by CountMatrixSet.recalibrate() to generate improved quality scores, which are then used as input for pass 2 matrix generation through the same GBMatrixSet accumulation process.

Error Probability Estimation

For each base position, the CountMatrixSet class implements four distinct error probability estimation methods selectable via USE_* boolean flags:

Average Method (estimateErrorProbAvg): Sums error probabilities from all loaded matrix types and returns sum/(float)x where x is the number of contributing matrices
Maximum Method (estimateErrorProbMax): Uses Tools.max(max, f) to track the highest error probability from any loaded matrix, returning the maximum value found
Weighted Average (estimateErrorProbWeighted): Combines observation counts from all matrices using (bad+fakeBad)/(sum+fakeSum) where fakeBad and fakeSum provide regularization based on obs_cutoff and BAD_CUTOFF constants
SMR Method (estimateErrorProbSMR): Implements a square-mean-root calculation combining matrix probabilities with geometric averaging

Variant Handling

The algorithm integrates with CallVariants class to generate VarMap and ScafMap objects. The AnalyzeVars.fixVars() method processes reads to identify variant positions, converting 'S' (substitution) symbols to 'V' (variant) in the match array when positions are found in the loaded variant map. This prevents true biological variants from being counted as sequencing errors during matrix accumulation.

Statistical Confidence

The modify() method implements Bayesian regularization using observation cutoffs (OBSERVATION_CUTOFF[pass]) to prevent overfitting. For low-observation entries, it computes sum2=sum+cutoff and bad2=bad+expected*cutoff where expected comes from QualityTools.PROB_ERROR[phred] lookup table. The final error probability is calculated as bad2/sum2, blending observed and expected error rates based on observation confidence.

Memory and Performance

The algorithm uses Worker threads (extending Thread class) with either ConcurrentReadInputStream or SamReadStreamer for parallel processing. Memory usage is determined by matrix dimensions: largest matrices like qb123GoodMatrix require 2×42×6×6×6×8 bytes (≈250KB each), while 5-dimensional matrices like q12b12GoodMatrix need 2×42×42×6×6×8 bytes (≈2.5MB each). The process_MT() method manages thread pools with Tools.mid(1, threads, 16) worker thread limits.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org