CallVariants2

Script: callvariants2.sh Package: var2 Class: CallVariants2.java

Multi-sample variant calling from SAM/BAM files. Each input file represents an individual sample for population-level variant analysis. This tool is equivalent to running callvariants.sh with the "multi" flag, designed specifically for cohort studies and comparative genomics.

Important: CallVariants2 vs CallVariants

Tool Selection Guidelines:

  • Use CallVariants2 when you have multiple SAM/BAM files, each representing a different sample/individual
  • Use CallVariants when you have only one sample, even if that sample consists of multiple SAM/BAM files (e.g., multiple sequencing runs of the same individual)
  • Key Difference: CallVariants2 reports allele frequencies for ALL samples at locations where ANY sample has a variant, enabling population-level comparisons
  • Algorithm: Both tools use identical variant calling algorithms - CallVariants2 is not "better," just designed for multi-sample workflows

Basic Usage

callvariants2.sh in=sample1.bam,sample2.bam,sample3.bam ref=reference.fa vcf=variants.vcf ploidy=2

Essential parameters: Input files may be sorted or unsorted, compressed or uncompressed. The reference should be FASTA format. Ploidy must be set according to your organism - this is crucial for accurate allele frequency assessment.

Algorithm Details

Multi-Sample Variant Discovery

CallVariants2 implements a sophisticated two-pass algorithm specifically designed for population-level variant analysis:

Two-Pass Processing Strategy

Quality Assessment Framework

Integrates multiple statistical dimensions for robust variant filtering, calibrated specifically for Illumina sequencing platforms:

Memory-Efficient Architecture

Scalable design supports both small studies and population-scale analyses:

Platform-Specific Optimization

Currently calibrated for Illumina sequencing data, with specific optimizations:

Ploidy and Population Considerations

Ploidy Configuration

Critical: Accurate ploidy setting is essential for proper variant calling. Allele fractions below expected levels receive quality penalties.

Allele Frequency Analysis

Population-level allele frequency calculation considers sample independence:

Parameters

CallVariants2 accepts all CallVariants parameters. Key parameters are organized by functional category:

I/O Parameters

in=<file>
Input SAM/BAM files; comma-delimited list where each file represents one sample. May be sorted or unsorted, compressed with gzip/bzip2.
list=<file>
Text file containing one input file path per line. Alternative to comma-delimited in= parameter.
out=<file>
Output variant file. Extension determines format: .vcf for VCF, .txt for native BBTools format.
vcf=<file>
Output VCF file with multi-sample genotype columns.
ref=<file>
Reference FASTA file. Required for accurate variant calling and determining contig boundaries for strand bias calculation.
vcfin=<file>
Input VCF file(s) with forced variant positions. These positions will be reported regardless of quality filters.
sample=<names>
Comma-delimited sample names. If not specified, names are derived from input file names with automatic collision resolution.
vcf0=<pattern>
Per-sample VCF output pattern. Use % as placeholder for sample name (e.g., "individual_%.vcf.gz").
shist=<file>
Output file for variant score histogram analysis.
zhist=<file>
Output file for zygosity distribution histogram.
qhist=<file>
Output file for base quality histogram by variant type.

Processing Parameters

ploidy=1
Critical parameter: Set organism ploidy. Affects allele frequency interpretation and quality scoring. Default assumes haploid organisms.
prefilter=f
Enable memory-efficient k-mer prefiltering. Reduces memory usage by 50-80% at the cost of doubled runtime. Results are identical.
rarity=1.0
Quality penalty threshold for low-frequency variants. Variants below this allele fraction receive score penalties. Should consider ploidy when setting.
minallelefraction=0.1
Minimum allele fraction for variant reporting. Adjust based on expected population diversity and ploidy.
useidentity=t
Include average read identity in quality score calculation. Helps filter alignment artifacts.
usepairing=t
Include proper pairing rate in quality assessment. Automatically disabled for single-end data.
usebias=t
Include strand bias in quality calculation. Disable for strand-specific protocols (minstrandratio=0 usebias=f).
homopolymer=t
Apply quality penalties to variants in homopolymer contexts, which are prone to sequencing errors.
nscan=t
Consider distance from contig ends when calculating strand bias, important for reference quality assessment.

Memory and Performance Parameters

32bit=f
Enable coverage tracking above 65,535x depth. Increases memory usage but necessary for very high coverage datasets.
atomic=auto
Atomic memory operations for thread safety. Automatically enabled for >8 threads, forces 32bit=true.
strandedcov=f
Track per-strand coverage for MCOV and DP4 VCF fields. Increases memory usage but provides additional statistics.
samstreamer=t
Enable multi-threaded SAM reading. Specify number of threads (e.g., ss=6) or disable (ss=f).
streamermf=8
Maximum number of SAM files to read simultaneously. Reduce if memory-constrained.

Quality Filtering Parameters

minreads=2
Minimum number of reads supporting a variant. Core parameter for sensitivity/specificity balance.
mincov=0
Minimum coverage depth for variant calling. Set based on expected coverage distribution.
minqualitymax=15
Minimum maximum base quality among supporting reads. Filters low-quality base calls.
minquality=12.0
Minimum average base quality among supporting reads.
minedist=10.0
Minimum average distance from read ends. Filters end-repair and adapter artifacts.
minpairingrate=0.1
Minimum proper pairing rate among supporting reads. Automatically disabled for unpaired data.
minstrandratio=0.1
Minimum ratio of forward to reverse strand coverage. Set to 0 for strand-specific protocols.
minscore=20.0
Minimum Phred-scaled variant quality score. Primary threshold for variant acceptance.
clearfilters
Reset all quality filters to permissive values. Useful for exploratory analysis or low-quality data.

Realignment Parameters

realign=f
Enable read realignment around indels. Recommended for aligners other than BBMap. Reduces speed but improves accuracy.
unclip=f
Convert clipped bases to matches/mismatches during realignment. Useful for recovering marginally-mapped regions.
repadding=70
Padding around realignment region. Longer padding improves accuracy for large indels but reduces speed.
rerows=602
Maximum read length for realignment. Reads exceeding this length cannot be realigned.
recols=2000
Maximum reference segment length for realignment. Must accommodate read length plus largest expected indel plus 2×padding.

Neural Network Parameters

net=<file>
Path to neural network model file for enhanced variant quality assessment.
netcutoff=auto
Neural network score threshold. Use 'auto' for model default or specify custom float value.
usenet=f
Enable neural network scoring. Requires compatible model file.
netmode=<mode>
Feature extraction mode for neural network input. Affects how variant context is encoded.

Java Parameters

-Xmx
Java heap memory allocation. CallVariants2 auto-detects available memory and uses ~85%. Override for specific requirements (e.g., -Xmx20g).
-eoom
Exit on out-of-memory errors rather than crashing. Requires Java 8u92+.

Usage Examples

Basic Multi-Sample Variant Calling

callvariants2.sh in=sample1.bam,sample2.bam,sample3.bam ref=reference.fa vcf=population.vcf ploidy=2

Standard diploid variant calling across three samples with population-level VCF output.

Large Cohort with Memory Optimization

callvariants2.sh list=samples.txt ref=reference.fa vcf=cohort.vcf ploidy=2 \
    prefilter=t minreads=3 minallelefraction=0.02

Process large cohort using file list and prefiltering. Suitable for population studies with >20 samples.

Low-Frequency Variant Detection

callvariants2.sh in=tumor1.bam,tumor2.bam,normal.bam ref=reference.fa vcf=somatic.vcf \
    ploidy=2 maf=0.02 rarity=0.02 minallelefraction=0.02 minreads=5

Optimized for detecting low-frequency variants (2% allele fraction) in cancer samples.

PacBio Long-Read Variant Calling

callvariants2.sh in=pacbio1.bam,pacbio2.bam ref=reference.fa vcf=longread.vcf \
    ploidy=2 minquality=0 minqualitymax=0 minscore=10 usequality=f

Adapted settings for PacBio data with adjusted quality parameters for long-read characteristics.

Quality Score Recalibration Workflow

# Initial mapping
bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam

# Initial variant calling
callvariants2.sh in=mapped.sam ref=reference.fa vcf=initial.vcf ploidy=2

# Quality recalibration
calctruequality.sh in=mapped.sam vcf=initial.vcf
bbduk.sh in=mapped.sam out=recalibrated.sam recalibrate

# Final high-accuracy variant calling
callvariants2.sh in=recalibrated.sam ref=reference.fa vcf=final.vcf ploidy=2

Complete workflow including quality score recalibration for maximum accuracy.

Per-Sample Output Generation

callvariants2.sh in=a.bam,b.bam,c.bam ref=reference.fa vcf=merged.vcf \
    vcf0=individual_%.vcf.gz sample=SampleA,SampleB,SampleC bgzip=t

Generate both merged population VCF and compressed individual sample VCFs.

Strand-Specific Protocol Adaptation

callvariants2.sh in=stranded1.bam,stranded2.bam ref=reference.fa vcf=stranded.vcf \
    ploidy=2 minstrandratio=0 usebias=f

Disable strand bias filtering for strand-specific RNA-seq or other directional protocols.

Memory-Constrained High-Coverage Analysis

callvariants2.sh in=high_cov.list ref=reference.fa vcf=variants.vcf ploidy=2 \
    prefilter=t -Xmx10g threads=4 streamermf=2

Process high-coverage data with explicit memory limits and reduced thread usage.

Performance and Optimization

Memory Management

Threading and I/O

Input File Optimization

Typical Performance Metrics

Technical Notes

Platform Compatibility

Statistical Framework

Output Formats

Related Tools

Troubleshooting

Memory Issues

Quality Issues

Performance Issues