VCF2GFF
Generates a GFF3 from a VCF.
Basic Usage
vcf2gff.sh in=<vcf file> out=<gff file>
VCF2GFF converts Variant Call Format (VCF) files to General Feature Format version 3 (GFF3) files. Each variant call in the VCF becomes a sequence_variant_obs feature in the GFF3 output.
Parameters
VCF2GFF has a minimal parameter set focused on file input/output specification.
Parameters
- in=<file>
- Input VCF file. Standard VCF format with variant calls to be converted to GFF3 features. Can be compressed (gzip).
- out=<file>
- Output GFF file. Will be written in GFF3 format with sequence_variant_obs features representing each variant.
Examples
Basic VCF to GFF3 Conversion
vcf2gff.sh in=variants.vcf out=variants.gff3
Converts a VCF file containing variant calls to GFF3 format.
Converting Compressed VCF
vcf2gff.sh in=variants.vcf.gz out=variants.gff3
Processes a gzip-compressed VCF file and outputs uncompressed GFF3.
Pipeline Usage
# Call variants and convert to GFF3
callvariants.sh in=reads.sam ref=reference.fa out=variants.vcf
vcf2gff.sh in=variants.vcf out=variants.gff3
Example pipeline showing variant calling followed by GFF3 conversion for annotation workflows.
Algorithm Details
VCF2GFF implements direct format conversion using the GffLine(VCFLine vcf)
constructor, which transforms VCF variant records into standardized GFF3 sequence_variant_obs features through byte-level parsing and coordinate transformation.
Conversion Architecture
- Parser Pipeline: VCFLine objects are processed through GffLine constructor with direct field mapping
- Coordinate Transformation: VCF 0-based coordinates converted via
start=vcf.start()+1; stop=vcf.stop()+1
- Strand Assignment: All variants assigned PLUS constant (0) as variants are position-specific, not directional
- Source Attribution: Source field set to DOTS constant (".") indicating no specific annotation source
Variant Type Classification
Variant types are determined using vcf.type()
and encoded via Var.typeArray[vtype]
constants with ByteBuilder string construction:
- Substitutions (SUB):
bb.append(vcf.ref).append('>').append(vcf.alt)
creates "ID=SUB ref>alt" - Insertions (INS): Extracts inserted sequence using
vcf.alt, offset, length
substring operation - Deletions (DEL): Calculates deletion size via
vcf.reflen()-vcf.readlen()
producing "ID=DEL length N" - No-calls (NOCALL): Uses
vcf.reflen()
for uncertain region length in "ID=NOCALL length N"
Quality Score Processing
VCF QUAL values are directly cast to float via score=(float)vcf.qual
preserving variant call confidence scores in GFF3 format without transformation or normalization.
Coordinate System Implementation
Coordinate mapping utilizes VCFLine accessor methods with precise boundary calculations:
- Start Position:
vcf.start()+1
converts 0-based VCF POS to 1-based GFF3 coordinates - End Position:
vcf.stop()+1
handles variant span including multi-base substitutions - Span Calculation: Uses
Tools.max(v.start+1, v.stop)
ensuring valid coordinate ranges
Memory Management
Line-by-line processing using ByteFile input streams with ByteBuilder string manipulation (16-byte initial capacity) minimizes memory footprint. The 200MB heap allocation (-Xmx200m) accommodates VCFLine object instantiation and temporary string construction without requiring full file buffering.
File Format Details
Input VCF Requirements
- Standard VCF format (version 4.0 or higher recommended)
- Must contain CHROM, POS, REF, ALT, and QUAL fields
- Can be compressed with gzip
- Headers are processed but not required for conversion
Output GFF3 Format
The output follows GFF3 specification with these characteristics:
- Column 1 (seqid): Chromosome/scaffold name from VCF CHROM field
- Column 2 (source): Set to "." (no specific source)
- Column 3 (type): Always "sequence_variant_obs"
- Column 4-5 (start-end): 1-based coordinates of the variant
- Column 6 (score): VCF QUAL value (or "." if missing)
- Column 7 (strand): Always "+" (plus strand)
- Column 8 (phase): Always "." (not applicable to variants)
- Column 9 (attributes): Variant type and details encoded as described above
Use Cases
- Genome Annotation: Converting variant calls to GFF3 for integration with genome browsers and annotation pipelines
- Comparative Genomics: Standardizing variant representations across different analysis tools
- Data Exchange: Converting between VCF and GFF3 formats for tools that require specific input formats
- Visualization: Preparing variant data for genome browsers that prefer GFF3 format
- Annotation Databases: Loading variant information into genomic databases that use GFF3 as their standard format
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org