CallVariants

Script: callvariants.sh Package: var2 Class: CallVariants.java

Multithreaded variant caller that processes SAM or BAM files to output VCF files. Originally developed for mapping-based quality-score recalibration and error-rate analysis with CalcTrueQuality and BBDuk, now available as a standalone variant caller supporting indel realignment, multi-sample calling, and arbitrary ploidy.

Development Context

CallVariants was originally developed for use with CalcTrueQuality and BBDuk for mapping-based quality-score recalibration and error-rate analysis in the presence of heterozygous mutations. CalcTrueQuality can call and ignore variants automatically using CallVariants, while BBDuk can accept VCF files and ignore known variants when calculating error statistics.

Platform Compatibility

Important: CallVariants quality scoring is currently calibrated for Illumina data, specifically NextSeq and HiSeq 2500 platforms. Any Illumina platform should produce similar results. While the tool works with other sequencing platforms, PacBio and Nanopore data may require manual adjustment of variant filter settings (particularly minquality, minqualitymax, minscore, and usequality).

Critical Ploidy Setting

Essential: CallVariants defaults to ploidy=1 (haploid organisms) as it was developed for haploid-focused research. For diploid organisms, you must set ploidy=2, and for tetraploids ploidy=4, etc. Allele fractions lower than expected for the ploidy (below 0.5 for diploid, 0.25 for tetraploid) will incur score penalties, making correct ploidy setting crucial for accurate variant calling.

Basic Usage

callvariants.sh in=<file,file,...> ref=<file> vcf=<file>

Input may be sorted or unsorted. The reference should be in FASTA format. For detailed usage information, see bbtools/docs/guides/CallVariantsGuide.txt or bbmap/pipelines/variantPipeline.sh for pipeline examples.

Parameters

I/O parameters

in=<file>
Input; may be one file or multiple comma-delimited files.
list=<file>
Optional text file containing one input file per line. Use list or in, but not both.
out=<file>
Output variant list in var format. If the name ends with .vcf then it will be vcf format.
vcf=<file>
Output variant list in vcf format.
outgff=<file>
Output variant list in gff format.
ref=<file>
Reference fasta. Required to display ref alleles. Variant calling will be more accurate with the reference.
vcfin=<file>
Force calls at these locations, even if allele count is 0.
shist=<file>
(scorehist) Output for variant score histogram.
zhist=<file>
(zygosityhist) Output for zygosity histogram.
qhist=<file>
(qualityhist) Output for variant base quality histogram.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file.
extended=t
Print additional variant statistics columns.
sample=
Optional comma-delimited list of sample names.
multisample=f
(multi) Set to true if there are multiple sam/bam files, and each should be tracked as an individual sample.
vcf0=
Optional comma-delimited list of per-sample outputs. Only used in multisample mode.
bgzip=t
Use bgzip for gzip compression.
samstreamer=t
(ss) Load reads multithreaded to increase speed. Disable to reduce the number of threads used. The number of streamer threads can be set with e.g. 'ss=4'; default is 6.
streamermf=8
(ssmf) Allow multiple sam files to be read simultaneously. Set ssmf=X to specify the maximum number or ssmf=f to disable.

Processing Parameters

prefilter=f
Use a Bloom filter to exclude variants seen fewer than minreads times. Doubles the runtime but greatly reduces memory usage. The results are identical.
coverage=t
(cc) Calculate coverage, to better call variants.
ploidy=1
Set the organism's ploidy.
rarity=1.0
Penalize the quality of variants with allele fraction lower than this. For example, if you are interested in 4% frequency variants, you could set both rarity and minallelefraction to 0.04. This is affected by ploidy - a variant with frequency indicating at least one copy is never penalized.
covpenalty=0.8
(lowcoveragepenalty) A lower penalty will increase the scores of low-coverage variants, and is useful for low-coverage datasets.
useidentity=t
Include average read identity in score calculation.
usepairing=t
Include pairing rate in score calculation.
usebias=t
Include strand bias in score calculation.
useedist=t
Include read-end distance in score calculation.
homopolymer=t
Penalize scores of substitutions matching adjacent bases.
nscan=t
Consider the distance of a variant from contig ends when calculating strand bias.
callsub=t
Call substitutions.
calldel=t
Call deletions.
callins=t
Call insertions.
calljunct=f
Call junctions (in development).
nopassdot=f
Use . as genotype for variations failing the filter.

Coverage Parameters (these mainly affect speed and memory use)

32bit=f
Set to true to allow coverage tracking over depth 65535, which increases memory use. Variant calls are impacted where coverage exceeds the maximum.
atomic=auto
Increases multithreaded speed; forces 32bit to true. Defaults to true if there are more than 8 threads.
strandedcov=f
(stranded) Tracks per-strand ref coverage to print the MCOV and DP4 fields. Requires more memory when enabled. Strand of variant reads is tracked regardless of this flag.

Trimming Parameters

border=5
Trim at least this many bases on both ends of reads.
qtrim=r
Quality-trim reads on this end r: right, l: left, rl: both, f: don't quality-trim.
trimq=10
Quality-trim bases below this score.

Realignment Parameters

realign=f
Realign all reads with more than a couple mismatches. Decreases speed. Recommended for aligners other than BBMap.
unclip=f
Convert clip symbols from exceeding the ends of the realignment zone into matches and substitutitions.
repadding=70
Pad alignment by this much on each end. Typically, longer is more accurate for long indels, but greatly reduces speed.
rerows=602
Use this many rows maximum for realignment. Reads longer than this cannot be realigned.
recols=2000
Reads may not be aligned to reference seqments longer than this. Needs to be at least read length plus max deletion length plus twice padding.
msa=
Select the aligner. Options: MultiStateAligner11ts: Default. MultiStateAligner9PacBio: Use for PacBio reads, or for Illumina reads mapped to PacBio/Nanopore reads.

Sam-filtering Parameters

minpos=
Ignore alignments not overlapping this range.
maxpos=
Ignore alignments not overlapping this range.
minreadmapq=4
Ignore alignments with lower mapq.
contigs=
Comma-delimited list of contig names to include. These should have no spaces, or underscores instead of spaces.
secondary=f
Include secondary alignments.
supplementary=f
Include supplementary alignments.
duplicate=f
Include reads flagged as duplicates.
invert=f
Invert sam filters.

Variant-Calling Cutoff Parameters

minreads=2
(minad) Ignore variants seen in fewer reads.
maxreads=BIG
(maxad) Ignore variants seen in more reads.
mincov=0
Ignore variants in lower-coverage locations.
maxcov=BIG
Ignore variants in higher-coverage locations.
minqualitymax=15
Ignore variants with lower max base quality.
minedistmax=20
Ignore variants with lower max distance from read ends.
minmapqmax=0
Ignore variants with lower max mapq.
minidmax=0
Ignore variants with lower max read identity.
minpairingrate=0.1
Ignore variants with lower pairing rate.
minstrandratio=0.1
Ignore variants with lower plus/minus strand ratio.
minquality=12.0
Ignore variants with lower average base quality.
minedist=10.0
Ignore variants with lower average distance from ends.
minavgmapq=0.0
Ignore variants with lower average mapq.
minallelefraction=0.1
Ignore variants with lower allele fraction. This should be adjusted for high ploidies.
minid=0
Ignore variants with lower average read identity.
minscore=20.0
Ignore variants with lower Phred-scaled score.
clearfilters
Clear all filters. Filter flags placed after the clearfilters flag will still be applied.

Other Parameters

minvarcopies=0
If set to 0, a genotype (vcf GT field) of 0 or 0/0 will be called if observed allele frequency suggests this is a minor allele. If set to 1, GT field will contain at least one 1.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Usage Examples

Single-sample variant calling

callvariants.sh in=mapped.sam out=vars.vcf ref=ecoli.fa ploidy=1
callvariants.sh in=lane1.sam,lane2.sam,lane3.sam out=vars.vcf ref=human.fa ploidy=2

Standard variant calling for haploid (E. coli) and diploid (human) organisms.

Single-sample streaming

bbmap.sh in=reads.fq out=stdout.sam ref=ecoli.fa | callvariants.sh in=stdin.sam out=vars.vcf ref=ecoli.fa ploidy=1

Stream reads directly from mapper to variant caller without intermediate files.

Multi-sample variant calling

callvariants.sh in=sample1.sam,sample2.sam,sample3.sam out=vars.vcf ref=ecoli.fa multisample ploidy=1

Process multiple samples simultaneously, each as a separate column in VCF output.

Low-frequency variant calling

callvariants.sh in=mapped.sam out=vars.vcf ref=human.fa maf=0.02 rarity=0.02 ploidy=2

Detect variants at 2% allele frequency for sensitive variant detection.

Raw PacBio variant calling

callvariants.sh in=mapped.sam out=vars.vcf ref=human.fa ploidy=2 minquality=0 minqualitymax=0 minscore=10 usequality=f

PacBio-specific settings that disable quality-based filtering due to different error profiles.

Quality score recalibration workflow

# Step 1: Initial mapping
bbmap.sh ref=ref.fa in=reads.fq out=mapped.sam

# Step 2: Initial variant calling  
callvariants.sh in=mapped.sam out=initial.vcf ref=ref.fa ploidy=1

# Step 3: Quality recalibration
calctruequality.sh in=mapped.sam
bbduk.sh in=mapped.sam out=recal.sam recalibrate

# Step 4: Final variant calling with recalibrated qualities
callvariants.sh in=recal.sam out=final.vcf ref=ref.fa ploidy=1

Complete quality score recalibration workflow for improved variant calling accuracy.

Memory and thread optimization

callvariants.sh in=mapped.sam out=vars.vcf ref=ref.fa ploidy=1 -Xmx10g threads=4 prefilter

Constrain memory to 10 GB with 4 threads and use prefilter to reduce memory usage at cost of speed.

Algorithm Details

Variant Quality Scoring System

CallVariants uses a multi-factor scoring system that evaluates variants based on ten key statistics:

Each metric contributes to the overall Phred-scaled variant score and can also serve as a standalone filter. This dual approach allows both holistic scoring and targeted filtering.

Memory Management and Performance

Automatic Memory Detection

CallVariants automatically detects available physical memory and uses approximately 85% of it unless overridden with -Xmx. Memory requirements are roughly 5 bytes per reference base plus several hundred bytes per variant.

Prefilter Strategy

The optional prefilter uses a Bloom filter-like structure to identify and exclude low-frequency variants before full processing, significantly reducing memory usage for large datasets at the cost of doubled runtime.

Performance Optimization

CallVariants speed is typically limited by input file reading rather than CPU processing. Performance improvements:

Mapping and Realignment Recommendations

BBMap is the recommended mapper for CallVariants. Reads mapped with BBMap should not be realigned as BBMap already produces high-quality alignments. For other mappers, realignment is highly recommended using the realign flag, though this increases runtime.

Platform-Specific Considerations

Illumina Data

CallVariants is calibrated for Illumina platforms and should work well with default settings on NextSeq, HiSeq, and similar instruments.

PacBio and Nanopore

Long-read platforms require parameter adjustment due to different error profiles:

Multi-sample Processing

In multisample mode, each input file becomes a separate sample with its own VCF column. CallVariants automatically delegates to CallVariants2 for multi-sample processing, maintaining memory efficiency regardless of sample count when filters are properly set.

Support

For questions and support: