CallVariants2

Important: CallVariants2 vs CallVariants

Tool Selection Guidelines:

Use CallVariants2 when you have multiple SAM/BAM files, each representing a different sample/individual
Use CallVariants when you have only one sample, even if that sample consists of multiple SAM/BAM files (e.g., multiple sequencing runs of the same individual)
Key Difference: CallVariants2 reports allele frequencies for ALL samples at locations where ANY sample has a variant, enabling population-level comparisons
Algorithm: Both tools use identical variant calling algorithms - CallVariants2 is not "better," just designed for multi-sample workflows

Basic Usage

callvariants2.sh in=sample1.bam,sample2.bam,sample3.bam ref=reference.fa vcf=variants.vcf ploidy=2

Essential parameters: Input files may be sorted or unsorted, compressed or uncompressed. The reference should be FASTA format. Ploidy must be set according to your organism - this is crucial for accurate allele frequency assessment.

Algorithm Details

Multi-Sample Variant Discovery

CallVariants2 implements a sophisticated two-pass algorithm specifically designed for population-level variant analysis:

Two-Pass Processing Strategy

Pass 1: Independent variant discovery across all samples identifies high-confidence variant positions
Pass 2: Re-analyzes all samples at the union of discovered variant sites for consistent population-level genotyping
Benefit: Ensures comprehensive variant detection where any sample contributing evidence results in evaluation across the entire cohort
Memory Management: Each sample maintains independent variant maps and statistics, preventing memory conflicts in large cohorts

Quality Assessment Framework

Integrates multiple statistical dimensions for robust variant filtering, calibrated specifically for Illumina sequencing platforms:

Base Quality Integration: Incorporates per-base quality scores with mapping quality for confidence assessment
Strand Bias Detection: Evaluates forward/reverse strand representation to identify systematic biases
Read Identity Analysis: Uses average read-to-reference identity to filter alignment artifacts
Pairing Information: Leverages proper pair rates and paired-end consistency
Positional Metrics: Distance from read ends and contig boundaries for artifact detection
Homopolymer Context: Special handling of variants in repetitive sequence regions

Memory-Efficient Architecture

Scalable design supports both small studies and population-scale analyses:

Memory Requirement: Base requirement is 5 bytes per reference base plus ~200 bytes per variant
Prefiltering System: Optional k-mer counting Bloom filter reduces memory usage by 50-80% for large cohorts
Atomic Operations: Automatically enabled for >8 threads to prevent memory contention
Thread-Local Processing: Each processing thread maintains independent variant maps until periodic consolidation

Platform-Specific Optimization

Currently calibrated for Illumina sequencing data, with specific optimizations:

Illumina Calibration: Quality score interpretation optimized for NextSeq and HiSeq 2500 platforms
PacBio/Nanopore Support: Requires manual adjustment of quality parameters (minquality=0, usequality=f)
Paired-End Optimization: Automatically adjusts pairing rate contributions based on overall dataset characteristics
CIGAR String Handling: Optimized for SAM 1.4+ format with explicit match/mismatch operators

Ploidy and Population Considerations

Ploidy Configuration

Critical: Accurate ploidy setting is essential for proper variant calling. Allele fractions below expected levels receive quality penalties.

Haploid (ploidy=1): Default setting, appropriate for bacteria, viruses, and organellar genomes
Diploid (ploidy=2): Standard for most eukaryotic nuclear genomes (mammals, plants, fungi)
Polyploid (ploidy>2): For wheat, strawberry, and other polyploid organisms
Impact: Variants with allele fractions below 1/ploidy threshold receive significant score penalties

Allele Frequency Analysis

Population-level allele frequency calculation considers sample independence:

Minimum Allele Fraction: Set minallelefraction based on expected population diversity
Rarity Parameter: Penalizes variants below expected frequency for the given ploidy
Multi-Sample Reporting: VCF output includes per-sample genotypes and population statistics

Parameters

CallVariants2 accepts all CallVariants parameters. Key parameters are organized by functional category:

I/O Parameters

in=<file>: Input SAM/BAM files; comma-delimited list where each file represents one sample. May be sorted or unsorted, compressed with gzip/bzip2.
list=<file>: Text file containing one input file path per line. Alternative to comma-delimited in= parameter.
out=<file>: Output variant file. Extension determines format: .vcf for VCF, .txt for native BBTools format.
vcf=<file>: Output VCF file with multi-sample genotype columns.
ref=<file>: Reference FASTA file. Required for accurate variant calling and determining contig boundaries for strand bias calculation.
vcfin=<file>: Input VCF file(s) with forced variant positions. These positions will be reported regardless of quality filters.
sample=<names>: Comma-delimited sample names. If not specified, names are derived from input file names with automatic collision resolution.
vcf0=<pattern>: Per-sample VCF output pattern. Use % as placeholder for sample name (e.g., "individual_%.vcf.gz").
shist=<file>: Output file for variant score histogram analysis.
zhist=<file>: Output file for zygosity distribution histogram.
qhist=<file>: Output file for base quality histogram by variant type.

Processing Parameters

ploidy=1: Critical parameter: Set organism ploidy. Affects allele frequency interpretation and quality scoring. Default assumes haploid organisms.
prefilter=f: Enable memory-efficient k-mer prefiltering. Reduces memory usage by 50-80% at the cost of doubled runtime. Results are identical.
rarity=1.0: Quality penalty threshold for low-frequency variants. Variants below this allele fraction receive score penalties. Should consider ploidy when setting.
minallelefraction=0.1: Minimum allele fraction for variant reporting. Adjust based on expected population diversity and ploidy.
useidentity=t: Include average read identity in quality score calculation. Helps filter alignment artifacts.
usepairing=t: Include proper pairing rate in quality assessment. Automatically disabled for single-end data.
usebias=t: Include strand bias in quality calculation. Disable for strand-specific protocols (minstrandratio=0 usebias=f).
homopolymer=t: Apply quality penalties to variants in homopolymer contexts, which are prone to sequencing errors.
nscan=t: Consider distance from contig ends when calculating strand bias, important for reference quality assessment.

Memory and Performance Parameters

32bit=f: Enable coverage tracking above 65,535x depth. Increases memory usage but necessary for very high coverage datasets.
atomic=auto: Atomic memory operations for thread safety. Automatically enabled for >8 threads, forces 32bit=true.
strandedcov=f: Track per-strand coverage for MCOV and DP4 VCF fields. Increases memory usage but provides additional statistics.
samstreamer=t: Enable multi-threaded SAM reading. Specify number of threads (e.g., ss=6) or disable (ss=f).
streamermf=8: Maximum number of SAM files to read simultaneously. Reduce if memory-constrained.

Quality Filtering Parameters

minreads=2: Minimum number of reads supporting a variant. Core parameter for sensitivity/specificity balance.
mincov=0: Minimum coverage depth for variant calling. Set based on expected coverage distribution.
minqualitymax=15: Minimum maximum base quality among supporting reads. Filters low-quality base calls.
minquality=12.0: Minimum average base quality among supporting reads.
minedist=10.0: Minimum average distance from read ends. Filters end-repair and adapter artifacts.
minpairingrate=0.1: Minimum proper pairing rate among supporting reads. Automatically disabled for unpaired data.
minstrandratio=0.1: Minimum ratio of forward to reverse strand coverage. Set to 0 for strand-specific protocols.
minscore=20.0: Minimum Phred-scaled variant quality score. Primary threshold for variant acceptance.
clearfilters: Reset all quality filters to permissive values. Useful for exploratory analysis or low-quality data.

Realignment Parameters

realign=f: Enable read realignment around indels. Recommended for aligners other than BBMap. Reduces speed but improves accuracy.
unclip=f: Convert clipped bases to matches/mismatches during realignment. Useful for recovering marginally-mapped regions.
repadding=70: Padding around realignment region. Longer padding improves accuracy for large indels but reduces speed.
rerows=602: Maximum read length for realignment. Reads exceeding this length cannot be realigned.
recols=2000: Maximum reference segment length for realignment. Must accommodate read length plus largest expected indel plus 2×padding.

Neural Network Parameters

net=<file>: Path to neural network model file for enhanced variant quality assessment.
netcutoff=auto: Neural network score threshold. Use 'auto' for model default or specify custom float value.
usenet=f: Enable neural network scoring. Requires compatible model file.
netmode=<mode>: Feature extraction mode for neural network input. Affects how variant context is encoded.

Java Parameters

-Xmx: Java heap memory allocation. CallVariants2 auto-detects available memory and uses ~85%. Override for specific requirements (e.g., -Xmx20g).
-eoom: Exit on out-of-memory errors rather than crashing. Requires Java 8u92+.

Usage Examples

Basic Multi-Sample Variant Calling

callvariants2.sh in=sample1.bam,sample2.bam,sample3.bam ref=reference.fa vcf=population.vcf ploidy=2

Standard diploid variant calling across three samples with population-level VCF output.

Large Cohort with Memory Optimization

callvariants2.sh list=samples.txt ref=reference.fa vcf=cohort.vcf ploidy=2 \
    prefilter=t minreads=3 minallelefraction=0.02

Process large cohort using file list and prefiltering. Suitable for population studies with >20 samples.

Low-Frequency Variant Detection

callvariants2.sh in=tumor1.bam,tumor2.bam,normal.bam ref=reference.fa vcf=somatic.vcf \
    ploidy=2 maf=0.02 rarity=0.02 minallelefraction=0.02 minreads=5

Optimized for detecting low-frequency variants (2% allele fraction) in cancer samples.

PacBio Long-Read Variant Calling

callvariants2.sh in=pacbio1.bam,pacbio2.bam ref=reference.fa vcf=longread.vcf \
    ploidy=2 minquality=0 minqualitymax=0 minscore=10 usequality=f

Adapted settings for PacBio data with adjusted quality parameters for long-read characteristics.

Quality Score Recalibration Workflow

# Initial mapping
bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam

# Initial variant calling
callvariants2.sh in=mapped.sam ref=reference.fa vcf=initial.vcf ploidy=2

# Quality recalibration
calctruequality.sh in=mapped.sam vcf=initial.vcf
bbduk.sh in=mapped.sam out=recalibrated.sam recalibrate

# Final high-accuracy variant calling
callvariants2.sh in=recalibrated.sam ref=reference.fa vcf=final.vcf ploidy=2

Complete workflow including quality score recalibration for maximum accuracy.

Per-Sample Output Generation

callvariants2.sh in=a.bam,b.bam,c.bam ref=reference.fa vcf=merged.vcf \
    vcf0=individual_%.vcf.gz sample=SampleA,SampleB,SampleC bgzip=t

Generate both merged population VCF and compressed individual sample VCFs.

Strand-Specific Protocol Adaptation

callvariants2.sh in=stranded1.bam,stranded2.bam ref=reference.fa vcf=stranded.vcf \
    ploidy=2 minstrandratio=0 usebias=f

Disable strand bias filtering for strand-specific RNA-seq or other directional protocols.

Memory-Constrained High-Coverage Analysis

callvariants2.sh in=high_cov.list ref=reference.fa vcf=variants.vcf ploidy=2 \
    prefilter=t -Xmx10g threads=4 streamermf=2

Process high-coverage data with explicit memory limits and reduced thread usage.

Performance and Optimization

Memory Management

Base Requirement: 5 bytes per reference base plus ~200 bytes per candidate variant
Prefiltering: Reduces memory usage by 50-80% for cohorts with >10 samples
Coverage Tracking: Use 32bit=f unless coverage regularly exceeds 65,535x
Auto-Detection: Tool automatically uses ~85% of available system memory

Threading and I/O

Processing Threads: Scales effectively up to 8 cores unless realignment is enabled
I/O Optimization: Uncompressed SAM files are fastest, followed by pigz-compressed files
SamStreamer: Multi-threaded reading improves performance for large files
Realignment: Fully utilizes all available cores when enabled

Input File Optimization

Mapped Reads Only: Files containing only mapped reads process up to 100x faster
Primary Alignments: Remove secondary/supplementary alignments for speed
Compression: Use pigz-compressed files for optimal balance of speed and storage
Sorting: Sorted vs unsorted files have minimal performance difference

Typical Performance Metrics

Small Cohort (3-5 samples, 30x coverage): 2-4 hours per human genome
Large Cohort (50+ samples, 30x coverage): 6-12 hours per genome with prefiltering
Memory Usage: 4-8 GB per sample without prefiltering, 2-4 GB with prefiltering
Throughput: Limited primarily by I/O bandwidth rather than CPU performance

Technical Notes

Platform Compatibility

Illumina Optimization: Quality scoring calibrated for NextSeq and HiSeq 2500 platforms
Long-Read Support: PacBio and Nanopore data require manual quality parameter adjustment
CIGAR Format: Optimized for SAM 1.4+ with explicit match/mismatch operators (= and X)
Legacy Support: Accepts older CIGAR strings (M only) if MD tags are present

Statistical Framework

Quality Integration: Combines base quality, mapping quality, and positional metrics
Bias Detection: Strand bias, pairing bias, and positional bias assessment
Population Context: Allele frequency calculations across the entire sample cohort
Filtering Cascade: Multiple independent and combined statistical filters

Output Formats

VCF Standard: Full compliance with VCF 4.2 specification for multi-sample data
Per-Sample VCFs: Individual files contain only variants passing filters for that sample
Population VCF: Merged file includes all samples with population-level statistics
Native Format: BBTools-specific format with extended statistical annotations

Related Tools

callvariants.sh - Single-sample variant calling with identical algorithm
bbmap.sh - Recommended aligner for optimal CallVariants2 performance
calctruequality.sh - Quality score recalibration for enhanced accuracy
bbduk.sh - Quality recalibration application and adapter trimming
vcffilter.sh - Post-processing VCF filtering and annotation
postfilter.sh - Additional variant filtering with population-specific parameters

Troubleshooting

Memory Issues

Out of Memory: Enable prefilter=t and reduce streamermf parameter
Large Variants: Use prefilter=t to ignore low-frequency variants consuming memory
High Coverage: Consider downsampling input or using 32bit=t with more memory

Quality Issues

Too Many Variants: Check ploidy setting and increase quality thresholds
Too Few Variants: Verify reference genome and consider clearfilters for initial exploration
Strand Bias: For strand-specific protocols, use minstrandratio=0 usebias=f

Performance Issues

Slow Processing: Remove unmapped reads and secondary alignments from input
I/O Bottleneck: Use uncompressed SAM or pigz-compressed files
Thread Scaling: Enable realign=t to fully utilize available CPU cores