CallVariants
Multithreaded variant caller that processes SAM or BAM files to output VCF files. Originally developed for mapping-based quality-score recalibration and error-rate analysis with CalcTrueQuality and BBDuk, now available as a standalone variant caller supporting indel realignment, multi-sample calling, and arbitrary ploidy.
Development Context
CallVariants was originally developed for use with CalcTrueQuality and BBDuk for mapping-based quality-score recalibration and error-rate analysis in the presence of heterozygous mutations. CalcTrueQuality can call and ignore variants automatically using CallVariants, while BBDuk can accept VCF files and ignore known variants when calculating error statistics.
Platform Compatibility
Important: CallVariants quality scoring is currently calibrated for Illumina data, specifically NextSeq and HiSeq 2500 platforms. Any Illumina platform should produce similar results. While the tool works with other sequencing platforms, PacBio and Nanopore data may require manual adjustment of variant filter settings (particularly minquality
, minqualitymax
, minscore
, and usequality
).
Critical Ploidy Setting
Essential: CallVariants defaults to ploidy=1 (haploid organisms) as it was developed for haploid-focused research. For diploid organisms, you must set ploidy=2
, and for tetraploids ploidy=4
, etc. Allele fractions lower than expected for the ploidy (below 0.5 for diploid, 0.25 for tetraploid) will incur score penalties, making correct ploidy setting crucial for accurate variant calling.
Basic Usage
callvariants.sh in=<file,file,...> ref=<file> vcf=<file>
Input may be sorted or unsorted. The reference should be in FASTA format. For detailed usage information, see bbtools/docs/guides/CallVariantsGuide.txt
or bbmap/pipelines/variantPipeline.sh
for pipeline examples.
Parameters
I/O parameters
- in=<file>
- Input; may be one file or multiple comma-delimited files.
- list=<file>
- Optional text file containing one input file per line. Use list or in, but not both.
- out=<file>
- Output variant list in var format. If the name ends with .vcf then it will be vcf format.
- vcf=<file>
- Output variant list in vcf format.
- outgff=<file>
- Output variant list in gff format.
- ref=<file>
- Reference fasta. Required to display ref alleles. Variant calling will be more accurate with the reference.
- vcfin=<file>
- Force calls at these locations, even if allele count is 0.
- shist=<file>
- (scorehist) Output for variant score histogram.
- zhist=<file>
- (zygosityhist) Output for zygosity histogram.
- qhist=<file>
- (qualityhist) Output for variant base quality histogram.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- extended=t
- Print additional variant statistics columns.
- sample=
- Optional comma-delimited list of sample names.
- multisample=f
- (multi) Set to true if there are multiple sam/bam files, and each should be tracked as an individual sample.
- vcf0=
- Optional comma-delimited list of per-sample outputs. Only used in multisample mode.
- bgzip=t
- Use bgzip for gzip compression.
- samstreamer=t
- (ss) Load reads multithreaded to increase speed. Disable to reduce the number of threads used. The number of streamer threads can be set with e.g. 'ss=4'; default is 6.
- streamermf=8
- (ssmf) Allow multiple sam files to be read simultaneously. Set ssmf=X to specify the maximum number or ssmf=f to disable.
Processing Parameters
- prefilter=f
- Use a Bloom filter to exclude variants seen fewer than minreads times. Doubles the runtime but greatly reduces memory usage. The results are identical.
- coverage=t
- (cc) Calculate coverage, to better call variants.
- ploidy=1
- Set the organism's ploidy.
- rarity=1.0
- Penalize the quality of variants with allele fraction lower than this. For example, if you are interested in 4% frequency variants, you could set both rarity and minallelefraction to 0.04. This is affected by ploidy - a variant with frequency indicating at least one copy is never penalized.
- covpenalty=0.8
- (lowcoveragepenalty) A lower penalty will increase the scores of low-coverage variants, and is useful for low-coverage datasets.
- useidentity=t
- Include average read identity in score calculation.
- usepairing=t
- Include pairing rate in score calculation.
- usebias=t
- Include strand bias in score calculation.
- useedist=t
- Include read-end distance in score calculation.
- homopolymer=t
- Penalize scores of substitutions matching adjacent bases.
- nscan=t
- Consider the distance of a variant from contig ends when calculating strand bias.
- callsub=t
- Call substitutions.
- calldel=t
- Call deletions.
- callins=t
- Call insertions.
- calljunct=f
- Call junctions (in development).
- nopassdot=f
- Use . as genotype for variations failing the filter.
Coverage Parameters (these mainly affect speed and memory use)
- 32bit=f
- Set to true to allow coverage tracking over depth 65535, which increases memory use. Variant calls are impacted where coverage exceeds the maximum.
- atomic=auto
- Increases multithreaded speed; forces 32bit to true. Defaults to true if there are more than 8 threads.
- strandedcov=f
- (stranded) Tracks per-strand ref coverage to print the MCOV and DP4 fields. Requires more memory when enabled. Strand of variant reads is tracked regardless of this flag.
Trimming Parameters
- border=5
- Trim at least this many bases on both ends of reads.
- qtrim=r
- Quality-trim reads on this end r: right, l: left, rl: both, f: don't quality-trim.
- trimq=10
- Quality-trim bases below this score.
Realignment Parameters
- realign=f
- Realign all reads with more than a couple mismatches. Decreases speed. Recommended for aligners other than BBMap.
- unclip=f
- Convert clip symbols from exceeding the ends of the realignment zone into matches and substitutitions.
- repadding=70
- Pad alignment by this much on each end. Typically, longer is more accurate for long indels, but greatly reduces speed.
- rerows=602
- Use this many rows maximum for realignment. Reads longer than this cannot be realigned.
- recols=2000
- Reads may not be aligned to reference seqments longer than this. Needs to be at least read length plus max deletion length plus twice padding.
- msa=
- Select the aligner. Options: MultiStateAligner11ts: Default. MultiStateAligner9PacBio: Use for PacBio reads, or for Illumina reads mapped to PacBio/Nanopore reads.
Sam-filtering Parameters
- minpos=
- Ignore alignments not overlapping this range.
- maxpos=
- Ignore alignments not overlapping this range.
- minreadmapq=4
- Ignore alignments with lower mapq.
- contigs=
- Comma-delimited list of contig names to include. These should have no spaces, or underscores instead of spaces.
- secondary=f
- Include secondary alignments.
- supplementary=f
- Include supplementary alignments.
- duplicate=f
- Include reads flagged as duplicates.
- invert=f
- Invert sam filters.
Variant-Calling Cutoff Parameters
- minreads=2
- (minad) Ignore variants seen in fewer reads.
- maxreads=BIG
- (maxad) Ignore variants seen in more reads.
- mincov=0
- Ignore variants in lower-coverage locations.
- maxcov=BIG
- Ignore variants in higher-coverage locations.
- minqualitymax=15
- Ignore variants with lower max base quality.
- minedistmax=20
- Ignore variants with lower max distance from read ends.
- minmapqmax=0
- Ignore variants with lower max mapq.
- minidmax=0
- Ignore variants with lower max read identity.
- minpairingrate=0.1
- Ignore variants with lower pairing rate.
- minstrandratio=0.1
- Ignore variants with lower plus/minus strand ratio.
- minquality=12.0
- Ignore variants with lower average base quality.
- minedist=10.0
- Ignore variants with lower average distance from ends.
- minavgmapq=0.0
- Ignore variants with lower average mapq.
- minallelefraction=0.1
- Ignore variants with lower allele fraction. This should be adjusted for high ploidies.
- minid=0
- Ignore variants with lower average read identity.
- minscore=20.0
- Ignore variants with lower Phred-scaled score.
- clearfilters
- Clear all filters. Filter flags placed after the clearfilters flag will still be applied.
Other Parameters
- minvarcopies=0
- If set to 0, a genotype (vcf GT field) of 0 or 0/0 will be called if observed allele frequency suggests this is a minor allele. If set to 1, GT field will contain at least one 1.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Usage Examples
Single-sample variant calling
callvariants.sh in=mapped.sam out=vars.vcf ref=ecoli.fa ploidy=1
callvariants.sh in=lane1.sam,lane2.sam,lane3.sam out=vars.vcf ref=human.fa ploidy=2
Standard variant calling for haploid (E. coli) and diploid (human) organisms.
Single-sample streaming
bbmap.sh in=reads.fq out=stdout.sam ref=ecoli.fa | callvariants.sh in=stdin.sam out=vars.vcf ref=ecoli.fa ploidy=1
Stream reads directly from mapper to variant caller without intermediate files.
Multi-sample variant calling
callvariants.sh in=sample1.sam,sample2.sam,sample3.sam out=vars.vcf ref=ecoli.fa multisample ploidy=1
Process multiple samples simultaneously, each as a separate column in VCF output.
Low-frequency variant calling
callvariants.sh in=mapped.sam out=vars.vcf ref=human.fa maf=0.02 rarity=0.02 ploidy=2
Detect variants at 2% allele frequency for sensitive variant detection.
Raw PacBio variant calling
callvariants.sh in=mapped.sam out=vars.vcf ref=human.fa ploidy=2 minquality=0 minqualitymax=0 minscore=10 usequality=f
PacBio-specific settings that disable quality-based filtering due to different error profiles.
Quality score recalibration workflow
# Step 1: Initial mapping
bbmap.sh ref=ref.fa in=reads.fq out=mapped.sam
# Step 2: Initial variant calling
callvariants.sh in=mapped.sam out=initial.vcf ref=ref.fa ploidy=1
# Step 3: Quality recalibration
calctruequality.sh in=mapped.sam
bbduk.sh in=mapped.sam out=recal.sam recalibrate
# Step 4: Final variant calling with recalibrated qualities
callvariants.sh in=recal.sam out=final.vcf ref=ref.fa ploidy=1
Complete quality score recalibration workflow for improved variant calling accuracy.
Memory and thread optimization
callvariants.sh in=mapped.sam out=vars.vcf ref=ref.fa ploidy=1 -Xmx10g threads=4 prefilter
Constrain memory to 10 GB with 4 threads and use prefilter to reduce memory usage at cost of speed.
Algorithm Details
Variant Quality Scoring System
CallVariants uses a multi-factor scoring system that evaluates variants based on ten key statistics:
- Base quality: Phred-scaled quality scores of variant bases
- Mapping quality: Alignment confidence scores
- Coverage depth: Total read depth at variant position
- Allele fraction: Proportion of reads supporting the variant
- Read identity: Overall sequence identity to reference
- Pairing rate: Fraction of properly paired reads
- End distance: Distance of variant from read ends
- Strand bias: Balance of variant calls between strands
- Contig end distance: Distance from reference sequence ends
- Homopolymer length: Length of adjacent identical bases
Each metric contributes to the overall Phred-scaled variant score and can also serve as a standalone filter. This dual approach allows both holistic scoring and targeted filtering.
Memory Management and Performance
Automatic Memory Detection
CallVariants automatically detects available physical memory and uses approximately 85% of it unless overridden with -Xmx
. Memory requirements are roughly 5 bytes per reference base plus several hundred bytes per variant.
Prefilter Strategy
The optional prefilter uses a Bloom filter-like structure to identify and exclude low-frequency variants before full processing, significantly reducing memory usage for large datasets at the cost of doubled runtime.
Performance Optimization
CallVariants speed is typically limited by input file reading rather than CPU processing. Performance improvements:
- File format: Uncompressed SAM or pigz-compressed files are faster than BAM
- Filtering: Remove unmapped reads and secondary alignments for significant speedup
- Threading: Scales well up to 8 cores unless realignment is enabled
- Streaming: Direct piping from mappers eliminates I/O bottlenecks
Mapping and Realignment Recommendations
BBMap is the recommended mapper for CallVariants. Reads mapped with BBMap should not be realigned as BBMap already produces high-quality alignments. For other mappers, realignment is highly recommended using the realign
flag, though this increases runtime.
Platform-Specific Considerations
Illumina Data
CallVariants is calibrated for Illumina platforms and should work well with default settings on NextSeq, HiSeq, and similar instruments.
PacBio and Nanopore
Long-read platforms require parameter adjustment due to different error profiles:
- Disable quality-based filtering:
minquality=0 minqualitymax=0 usequality=f
- Lower score thresholds:
minscore=10
- Use PacBio-specific realigner:
msa=MultiStateAligner9PacBio
Multi-sample Processing
In multisample mode, each input file becomes a separate sample with its own VCF column. CallVariants automatically delegates to CallVariants2 for multi-sample processing, maintaining memory efficiency regardless of sample count when filters are properly set.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Guide:
bbtools/docs/guides/CallVariantsGuide.txt
- Pipeline Example:
bbmap/pipelines/variantPipeline.sh