CallVariants2
Multi-sample variant calling from SAM/BAM files. Each input file represents an individual sample for population-level variant analysis. This tool is equivalent to running callvariants.sh with the "multi" flag, designed specifically for cohort studies and comparative genomics.
Important: CallVariants2 vs CallVariants
Tool Selection Guidelines:
- Use CallVariants2 when you have multiple SAM/BAM files, each representing a different sample/individual
- Use CallVariants when you have only one sample, even if that sample consists of multiple SAM/BAM files (e.g., multiple sequencing runs of the same individual)
- Key Difference: CallVariants2 reports allele frequencies for ALL samples at locations where ANY sample has a variant, enabling population-level comparisons
- Algorithm: Both tools use identical variant calling algorithms - CallVariants2 is not "better," just designed for multi-sample workflows
Basic Usage
callvariants2.sh in=sample1.bam,sample2.bam,sample3.bam ref=reference.fa vcf=variants.vcf ploidy=2
Essential parameters: Input files may be sorted or unsorted, compressed or uncompressed. The reference should be FASTA format. Ploidy must be set according to your organism - this is crucial for accurate allele frequency assessment.
Algorithm Details
Multi-Sample Variant Discovery
CallVariants2 implements a sophisticated two-pass algorithm specifically designed for population-level variant analysis:
Two-Pass Processing Strategy
- Pass 1: Independent variant discovery across all samples identifies high-confidence variant positions
- Pass 2: Re-analyzes all samples at the union of discovered variant sites for consistent population-level genotyping
- Benefit: Ensures comprehensive variant detection where any sample contributing evidence results in evaluation across the entire cohort
- Memory Management: Each sample maintains independent variant maps and statistics, preventing memory conflicts in large cohorts
Quality Assessment Framework
Integrates multiple statistical dimensions for robust variant filtering, calibrated specifically for Illumina sequencing platforms:
- Base Quality Integration: Incorporates per-base quality scores with mapping quality for confidence assessment
- Strand Bias Detection: Evaluates forward/reverse strand representation to identify systematic biases
- Read Identity Analysis: Uses average read-to-reference identity to filter alignment artifacts
- Pairing Information: Leverages proper pair rates and paired-end consistency
- Positional Metrics: Distance from read ends and contig boundaries for artifact detection
- Homopolymer Context: Special handling of variants in repetitive sequence regions
Memory-Efficient Architecture
Scalable design supports both small studies and population-scale analyses:
- Memory Requirement: Base requirement is 5 bytes per reference base plus ~200 bytes per variant
- Prefiltering System: Optional k-mer counting Bloom filter reduces memory usage by 50-80% for large cohorts
- Atomic Operations: Automatically enabled for >8 threads to prevent memory contention
- Thread-Local Processing: Each processing thread maintains independent variant maps until periodic consolidation
Platform-Specific Optimization
Currently calibrated for Illumina sequencing data, with specific optimizations:
- Illumina Calibration: Quality score interpretation optimized for NextSeq and HiSeq 2500 platforms
- PacBio/Nanopore Support: Requires manual adjustment of quality parameters (minquality=0, usequality=f)
- Paired-End Optimization: Automatically adjusts pairing rate contributions based on overall dataset characteristics
- CIGAR String Handling: Optimized for SAM 1.4+ format with explicit match/mismatch operators
Ploidy and Population Considerations
Ploidy Configuration
Critical: Accurate ploidy setting is essential for proper variant calling. Allele fractions below expected levels receive quality penalties.
- Haploid (ploidy=1): Default setting, appropriate for bacteria, viruses, and organellar genomes
- Diploid (ploidy=2): Standard for most eukaryotic nuclear genomes (mammals, plants, fungi)
- Polyploid (ploidy>2): For wheat, strawberry, and other polyploid organisms
- Impact: Variants with allele fractions below 1/ploidy threshold receive significant score penalties
Allele Frequency Analysis
Population-level allele frequency calculation considers sample independence:
- Minimum Allele Fraction: Set minallelefraction based on expected population diversity
- Rarity Parameter: Penalizes variants below expected frequency for the given ploidy
- Multi-Sample Reporting: VCF output includes per-sample genotypes and population statistics
Parameters
CallVariants2 accepts all CallVariants parameters. Key parameters are organized by functional category:
I/O Parameters
- in=<file>
- Input SAM/BAM files; comma-delimited list where each file represents one sample. May be sorted or unsorted, compressed with gzip/bzip2.
- list=<file>
- Text file containing one input file path per line. Alternative to comma-delimited in= parameter.
- out=<file>
- Output variant file. Extension determines format: .vcf for VCF, .txt for native BBTools format.
- vcf=<file>
- Output VCF file with multi-sample genotype columns.
- ref=<file>
- Reference FASTA file. Required for accurate variant calling and determining contig boundaries for strand bias calculation.
- vcfin=<file>
- Input VCF file(s) with forced variant positions. These positions will be reported regardless of quality filters.
- sample=<names>
- Comma-delimited sample names. If not specified, names are derived from input file names with automatic collision resolution.
- vcf0=<pattern>
- Per-sample VCF output pattern. Use % as placeholder for sample name (e.g., "individual_%.vcf.gz").
- shist=<file>
- Output file for variant score histogram analysis.
- zhist=<file>
- Output file for zygosity distribution histogram.
- qhist=<file>
- Output file for base quality histogram by variant type.
Processing Parameters
- ploidy=1
- Critical parameter: Set organism ploidy. Affects allele frequency interpretation and quality scoring. Default assumes haploid organisms.
- prefilter=f
- Enable memory-efficient k-mer prefiltering. Reduces memory usage by 50-80% at the cost of doubled runtime. Results are identical.
- rarity=1.0
- Quality penalty threshold for low-frequency variants. Variants below this allele fraction receive score penalties. Should consider ploidy when setting.
- minallelefraction=0.1
- Minimum allele fraction for variant reporting. Adjust based on expected population diversity and ploidy.
- useidentity=t
- Include average read identity in quality score calculation. Helps filter alignment artifacts.
- usepairing=t
- Include proper pairing rate in quality assessment. Automatically disabled for single-end data.
- usebias=t
- Include strand bias in quality calculation. Disable for strand-specific protocols (minstrandratio=0 usebias=f).
- homopolymer=t
- Apply quality penalties to variants in homopolymer contexts, which are prone to sequencing errors.
- nscan=t
- Consider distance from contig ends when calculating strand bias, important for reference quality assessment.
Memory and Performance Parameters
- 32bit=f
- Enable coverage tracking above 65,535x depth. Increases memory usage but necessary for very high coverage datasets.
- atomic=auto
- Atomic memory operations for thread safety. Automatically enabled for >8 threads, forces 32bit=true.
- strandedcov=f
- Track per-strand coverage for MCOV and DP4 VCF fields. Increases memory usage but provides additional statistics.
- samstreamer=t
- Enable multi-threaded SAM reading. Specify number of threads (e.g., ss=6) or disable (ss=f).
- streamermf=8
- Maximum number of SAM files to read simultaneously. Reduce if memory-constrained.
Quality Filtering Parameters
- minreads=2
- Minimum number of reads supporting a variant. Core parameter for sensitivity/specificity balance.
- mincov=0
- Minimum coverage depth for variant calling. Set based on expected coverage distribution.
- minqualitymax=15
- Minimum maximum base quality among supporting reads. Filters low-quality base calls.
- minquality=12.0
- Minimum average base quality among supporting reads.
- minedist=10.0
- Minimum average distance from read ends. Filters end-repair and adapter artifacts.
- minpairingrate=0.1
- Minimum proper pairing rate among supporting reads. Automatically disabled for unpaired data.
- minstrandratio=0.1
- Minimum ratio of forward to reverse strand coverage. Set to 0 for strand-specific protocols.
- minscore=20.0
- Minimum Phred-scaled variant quality score. Primary threshold for variant acceptance.
- clearfilters
- Reset all quality filters to permissive values. Useful for exploratory analysis or low-quality data.
Realignment Parameters
- realign=f
- Enable read realignment around indels. Recommended for aligners other than BBMap. Reduces speed but improves accuracy.
- unclip=f
- Convert clipped bases to matches/mismatches during realignment. Useful for recovering marginally-mapped regions.
- repadding=70
- Padding around realignment region. Longer padding improves accuracy for large indels but reduces speed.
- rerows=602
- Maximum read length for realignment. Reads exceeding this length cannot be realigned.
- recols=2000
- Maximum reference segment length for realignment. Must accommodate read length plus largest expected indel plus 2×padding.
Neural Network Parameters
- net=<file>
- Path to neural network model file for enhanced variant quality assessment.
- netcutoff=auto
- Neural network score threshold. Use 'auto' for model default or specify custom float value.
- usenet=f
- Enable neural network scoring. Requires compatible model file.
- netmode=<mode>
- Feature extraction mode for neural network input. Affects how variant context is encoded.
Java Parameters
- -Xmx
- Java heap memory allocation. CallVariants2 auto-detects available memory and uses ~85%. Override for specific requirements (e.g., -Xmx20g).
- -eoom
- Exit on out-of-memory errors rather than crashing. Requires Java 8u92+.
Usage Examples
Basic Multi-Sample Variant Calling
callvariants2.sh in=sample1.bam,sample2.bam,sample3.bam ref=reference.fa vcf=population.vcf ploidy=2
Standard diploid variant calling across three samples with population-level VCF output.
Large Cohort with Memory Optimization
callvariants2.sh list=samples.txt ref=reference.fa vcf=cohort.vcf ploidy=2 \
prefilter=t minreads=3 minallelefraction=0.02
Process large cohort using file list and prefiltering. Suitable for population studies with >20 samples.
Low-Frequency Variant Detection
callvariants2.sh in=tumor1.bam,tumor2.bam,normal.bam ref=reference.fa vcf=somatic.vcf \
ploidy=2 maf=0.02 rarity=0.02 minallelefraction=0.02 minreads=5
Optimized for detecting low-frequency variants (2% allele fraction) in cancer samples.
PacBio Long-Read Variant Calling
callvariants2.sh in=pacbio1.bam,pacbio2.bam ref=reference.fa vcf=longread.vcf \
ploidy=2 minquality=0 minqualitymax=0 minscore=10 usequality=f
Adapted settings for PacBio data with adjusted quality parameters for long-read characteristics.
Quality Score Recalibration Workflow
# Initial mapping
bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam
# Initial variant calling
callvariants2.sh in=mapped.sam ref=reference.fa vcf=initial.vcf ploidy=2
# Quality recalibration
calctruequality.sh in=mapped.sam vcf=initial.vcf
bbduk.sh in=mapped.sam out=recalibrated.sam recalibrate
# Final high-accuracy variant calling
callvariants2.sh in=recalibrated.sam ref=reference.fa vcf=final.vcf ploidy=2
Complete workflow including quality score recalibration for maximum accuracy.
Per-Sample Output Generation
callvariants2.sh in=a.bam,b.bam,c.bam ref=reference.fa vcf=merged.vcf \
vcf0=individual_%.vcf.gz sample=SampleA,SampleB,SampleC bgzip=t
Generate both merged population VCF and compressed individual sample VCFs.
Strand-Specific Protocol Adaptation
callvariants2.sh in=stranded1.bam,stranded2.bam ref=reference.fa vcf=stranded.vcf \
ploidy=2 minstrandratio=0 usebias=f
Disable strand bias filtering for strand-specific RNA-seq or other directional protocols.
Memory-Constrained High-Coverage Analysis
callvariants2.sh in=high_cov.list ref=reference.fa vcf=variants.vcf ploidy=2 \
prefilter=t -Xmx10g threads=4 streamermf=2
Process high-coverage data with explicit memory limits and reduced thread usage.
Performance and Optimization
Memory Management
- Base Requirement: 5 bytes per reference base plus ~200 bytes per candidate variant
- Prefiltering: Reduces memory usage by 50-80% for cohorts with >10 samples
- Coverage Tracking: Use 32bit=f unless coverage regularly exceeds 65,535x
- Auto-Detection: Tool automatically uses ~85% of available system memory
Threading and I/O
- Processing Threads: Scales effectively up to 8 cores unless realignment is enabled
- I/O Optimization: Uncompressed SAM files are fastest, followed by pigz-compressed files
- SamStreamer: Multi-threaded reading improves performance for large files
- Realignment: Fully utilizes all available cores when enabled
Input File Optimization
- Mapped Reads Only: Files containing only mapped reads process up to 100x faster
- Primary Alignments: Remove secondary/supplementary alignments for speed
- Compression: Use pigz-compressed files for optimal balance of speed and storage
- Sorting: Sorted vs unsorted files have minimal performance difference
Typical Performance Metrics
- Small Cohort (3-5 samples, 30x coverage): 2-4 hours per human genome
- Large Cohort (50+ samples, 30x coverage): 6-12 hours per genome with prefiltering
- Memory Usage: 4-8 GB per sample without prefiltering, 2-4 GB with prefiltering
- Throughput: Limited primarily by I/O bandwidth rather than CPU performance
Technical Notes
Platform Compatibility
- Illumina Optimization: Quality scoring calibrated for NextSeq and HiSeq 2500 platforms
- Long-Read Support: PacBio and Nanopore data require manual quality parameter adjustment
- CIGAR Format: Optimized for SAM 1.4+ with explicit match/mismatch operators (= and X)
- Legacy Support: Accepts older CIGAR strings (M only) if MD tags are present
Statistical Framework
- Quality Integration: Combines base quality, mapping quality, and positional metrics
- Bias Detection: Strand bias, pairing bias, and positional bias assessment
- Population Context: Allele frequency calculations across the entire sample cohort
- Filtering Cascade: Multiple independent and combined statistical filters
Output Formats
- VCF Standard: Full compliance with VCF 4.2 specification for multi-sample data
- Per-Sample VCFs: Individual files contain only variants passing filters for that sample
- Population VCF: Merged file includes all samples with population-level statistics
- Native Format: BBTools-specific format with extended statistical annotations
Related Tools
- callvariants.sh - Single-sample variant calling with identical algorithm
- bbmap.sh - Recommended aligner for optimal CallVariants2 performance
- calctruequality.sh - Quality score recalibration for enhanced accuracy
- bbduk.sh - Quality recalibration application and adapter trimming
- vcffilter.sh - Post-processing VCF filtering and annotation
- postfilter.sh - Additional variant filtering with population-specific parameters
Troubleshooting
Memory Issues
- Out of Memory: Enable prefilter=t and reduce streamermf parameter
- Large Variants: Use prefilter=t to ignore low-frequency variants consuming memory
- High Coverage: Consider downsampling input or using 32bit=t with more memory
Quality Issues
- Too Many Variants: Check ploidy setting and increase quality thresholds
- Too Few Variants: Verify reference genome and consider clearfilters for initial exploration
- Strand Bias: For strand-specific protocols, use minstrandratio=0 usebias=f
Performance Issues
- Slow Processing: Remove unmapped reads and secondary alignments from input
- I/O Bottleneck: Use uncompressed SAM or pigz-compressed files
- Thread Scaling: Enable realign=t to fully utilize available CPU cores