VCF playground – Level 1
TLDR;
Order of elements in GC field of gnomAD VCF for multi-allelic entries:
GC = AA, AB, BB, AC, BC, CC, AD, BD, CD, DD, AE, BE, CE, DE, EE, …
Ref. allele: A
Alt. alleles: B, C, D, …
Ever tried to make sense of the infamous VCF format? Found your way through all the ‘/’s and ‘|’s and ‘:’s, etc. …? Well, you can’t really blame it, VCF captures gorgeous types and humongous amounts of data but can sometimes get pretty…frustrating to get hold of it.
Today, I wanted to talk about understanding how information is structured in the Genotype Count field (GC) for multi-allelic variants. I couldn’t find any documentation about it so decided to give it a try myself. Long story short, I’ve now found the ‘formula’ and I’d like to share it with anyone else having the same issue 🙂
For an introduction on genotypes and VCFs you can have a look here:
First, let’s get started with the most trivial case:
# 1 alternative allele
- 1 reference (ref.) allele (A) and
- 1 alternative (alt.) allele (B)
It’s pretty standardised / intuitive to say that in this case:
GC=AA,AB,BB
Let’s take it one step further:
# 2 alternative alleles
- 1 ref. allele (A) and
- 2 alt. alleles (B,C)
What would you say in that case?
GC=AA,AB,AC,BB,BC,CC ?
or
GC=AA,AB,BB,AC,BC,CC ?
or even…
GC=AA,AB,BB,AC,CC,BC ?
Now, imagine an even trickier case:
# 3 alternative alleles
- 1 ref. allele (A) and
- 3 alt. alleles (B,C,D)
Is it:
GC=AA,AB,AC,AD,BB,BC,BD,CC,CD,DD ? GC=AA,AB,BB,BC,BD,AC,CC,CD,AD,DD ? GC=AA,AB,BB,AC,BC,CC,AD,BD,CD,DD ? ... ?
What each value in GC=[685, 13609, 272, 26, 17, 0, 6, 3, 0, 0] would correspond to in that case?
I guess you get the point by now…
So, since browsing through the official documentation and a handful of forum/mailing-lists wasn’t that fruitful, I had to dig in the original VCF files and try to find evidence about the most plausible order of elements within the GC field.
In order to do that, let’s think for a moment how we could calculate the GC field using other fields contained in the INFO column of a variant entry.
Given the Allele Counts (AC) and Allele Number (AN) for an allele, the genotype for this particular allele can be calculated as follows:
- Hom = {value in Hom field in VCF} –> Homologous_counts
- Het = AC – (2 x Hom) –> Heterozygous_counts
- Ref = AN / 2 – Hom – Het –> Reference_counts (total number of the original reference allele and the rest of alternative alleles, except the one under consideration)
so that:
GC = Ref,Het,Hom
Hmm… What if we looked then into a sufficient number of VCF entries trying to infer the (already calculated) GC field from the (also provided) AC and AN values, considering each time a different convention for the order of values within the GC field?
That’s right! 😀
After some digging and careful selection of VCF variants with sufficient number of non-zero values (to allow inference and validation), I was lucky enough to conclude that the convention used for the the GC field is as follows:
The formula:
GC = AA, AB, BB, AC, BC, CC, AD, BD, CD, DD, AE, BE, CE, DE, EE, …
where:
- Reference Allele:
A
- Alternative Alleles:
B
,C
,D
, … (in order of appearance in the VCF file)
Still in doubt? Fair enough… 🙂
I will shortly follow with a second post on this subject proving the validity of this convention with a set of representative and quite complex multi-allelic variant examples from a gnomad VCF file!
I love fools’ experiments. I am always making them. — Charles Darwin
Thank you so much for writing this, it’s really helpful!
LikeLike
you’re welcome! happy to hear that 🙂
LikeLike