VCF Playground – Level 2
Following my previous post on GC elements’ order, I’m now going to present an empirical proof of this convention! As a reminder, the inferred order of elements within a GC field is:
GC=AA, AB,BB, AC,BC,CC, AD,BD,CD,DD, AE,BE,CE,DE,EE, ... (1)
Notation used:
- Reference Allele:
A
- Alternative Alleles:
B
,C
,D
, … (in order of appearance in the VCF file)
We would need to have sufficient example entries from a VCF to show concordance between the hypothesised -but not yet validated- convention (1) and the GC values as calculated using the AC and AN fields from the VCF data (equations from 1st post copied below for convenience):
- Hom = {value in Hom field in VCF or gnomAD browser}
- Het = AC β (2 x Hom)
- Ref = AN / 2 β Hom β Het
So, let’s start our calculations… I will first cite two representative examples from chr21 with 4 alleles that validate our convention (1):
Examples
-
VCF entry:
21, 9416219, G, A,GCATT,GCA
gnomAD browser entry:
http://gnomad.broadinstitute.org/variant/21-9416219-G-A
GC=685, 13609, 272, 26, 17, 0, 6, 3, 0, 0
(GC=AA, AB, BB, AC, BC, CC, AD, BD, CD, DD)
- Allele B:
Hom = 272
Het = AC - 2*Hom = 14173 - 2*272 = 13629
Ref_b = AN/2 - Het - Hom = 29236/2 - Het - Hom = 717 =
= AA + AC + AD = 685 + 26 + 6
# (Rest of A counts, apart from AB which is already counted in Het)
-
VCF entry:
21, 9415561, CG, C,TG,GG
gnomAD browser entry:
http://gnomad.broadinstitute.org/variant/21-9415561-CG-C
GC=2925, 6449, 1, 1825, 84, 0, 14, 1, 0, 0
(GC=AA, AB, BB, AC, BC, CC, AD, BD, CD, DD)
- Allele B:
Het = 6536 - 2*1 = 6534
Hom = 1
Ref_b = 22598/2 - Het - Hom = 4764 = AA + AC + AD = 2925 + 1825 + 14
# (Rest of A counts, apart from AB which is already counted in Het)
- Allele C: (in VCF: CG/TG, in browser: C/T)
Het = 1909 - 2*0 = 1909 = AC + BC + CD = 1825 + 84 + 0 =
Hom = 0
Ref_c = 22598/2 - Het - Hom = 9390 =
= AA + AB + AD + BB + BD = 2925 + 6449 + 14 + 1 + 1
# (Rest of A and B counts, apart from AC and BC,
which are already counted in Het)
- Allele D: (in VCF: CG/GG, in browser: C/T)
Het = 15 - 2*0 = 15 = AD + BD + CD = 14 + 1 + 0
Hom = 0
Ref_d = 22598/2 - Het - Hom = 11284 =
= AA + AB + BB + AC + BC + CC = 2925 + 6449 + 1 + 1825 + 84 + 0
# (Rest of A, B and C counts, apart from AD, BD and CD,
which are already counted in Het)
So far, so good…
However, I’m still not convinced even after several validation examples of 4 alleles, especially when they have several zero values in some pairs within GC…
Let’s try to see if we can get something really solid from chr21, such as an entry with no zero-value at the last pair of the GC field:
$ zcat < gnomad.genomes.r2.0.2.sites.chr21.vcf.bgz | awk -F "GC=" '{print $2}' | cut -d';' -f1 | sort | uniq | grep -v "0$" | head -20
1151,536,2,27,0,1,261,1,0,2
12801,768,5,2,0,0,1,0,0,0,0,0,0,0,1
12863,77,0,44,0,0,21,0,0,1
4517,11,0,125,0,12,12,0,0,1
46,1435,0,439,47,1,2135,70,211,1,1417,788,155,497,49,921,849,310,986,367,117,25,29,15,34,13,14,1
And voila! The last line looks like the perfect example to test our convention: 7 alleles and just one zero-value pair in GC! So, let’s find the relevant entry in the original VCF file (did that already with grep) and then continue with the validation:
ULTIMATE CHECK: 7 alleles
-
VCF entry:
21, 9421430, GTTT, GT,GTT,G,GTTTTTT,GTTTT,GTTTTT
gnomAD browser entry:
http://gnomad.broadinstitute.org/variant/21-9421430-GTT-G
GC=46, 1435, 0, 439, 47, 1, 2135, 70, 211, 1, 1417, 788, 155, 497, 49, 921, 849, 310, 986, 367, 117, 25, 29, 15, 34, 13, 14, 1
(GC=AA, AB, BB, AC, BC, CC, AD, BD, CD, DD, AE, BE, CE, DE, EE, AF, BF, CF, DF, EF, FF, AG, BG, CG, DG, EG, FG, GG)
- Allele B: (in VCF: GTTT/GT, in browser: GTT/G)
Het = 3218 - 2 * 0 = 3218 = 1435 + 47 + 70 + 788 + 849 + 29 =
= AB + BC + BD + BE + BF + BG
- Allele C: (in VCF: GTTT/GTT, in browser: GT/G)
Het = 1179 - 2 = 1177 = 439 + 47 + 211 + 155 + 310 + 15 =
= AC + BC + CD + CE + CF + CG
- Allele D: (in VCF: GTTT/G, in browser: GTTT/G)
Het = 3935 - 2 = 3933 = 2135 + 70 + 211 + 497 + 986 + 34 =
= AD + BD + CD + DE + DF + DG
- Allele E: (in VCF: GTTT/G, in browser: G/GTTT)
Hom = 49
Het = 3335 - 2*49 = 3237 = 1417 + 788 + 155 + 497 + 367 + 13 =
= AE + BE + CE + DE + EF + EG
Ref_e = 21944/2 - Het - Hom = 7686 =
= 46 + 1435 + 439 + 2135 + 921 + 25 + 0 + 47 + 70 + 849 + 29 + 1 + 211 + 310 + 15 + 1 + 986 + 34 + 117 + 14 + 1
= AA + AB + AC + AD + AF + AG + BB + BC + BD + BF + BG + CC + CD + CF + CG + DD + DF + DG + FF + FG + GG =
# (Rest of A, B, C and D counts, apart from AE, BE, CE, DE which are already counted in Het)
So, that was it! π
Feel free to test any other alleles from this example to get a full grasp of these calculations!
SPECIAL CASES
There are some entries in the gnomAD VCF files where allele counts derived from GC do not add up to the total AN
value.
This is due to the presence of alleles with spanning deletions (denoted by STAR_*
fields) that are not counted eventually in the Genotype Count field.
Example
21
,9438414
,ACTTTTTGCGCCCACCGCGG
,A,ACGG,GCTTTTTGCGCCCACCGCGG
gnomAD browser entry:
http://gnomad.broadinstitute.org/variant/21-9438414-ACTTTTTGCGCCCACCGCGG-A
GC=1151, 536, 2, 27, 0, 1, 261, 1, 0, 2
(GC=AA, AB, BB, AC, BC, CC, AD, BD, CD, DD)
AN = 11108 – fields in GC cannot add up to this AN value
- Allele B: (in VCF: ACTTTTTGCGCCCACCGCGG/A, in browser: ACTTTTTGCGCCCACCGCGG/A)
Het = 541 - 2*2 = 537
- Allele C: (in VCF: ACTTTTTGCGCCCACCGCGG/ACGG, in browser: ACTTTTTGCGCCCACCG/A)
Het = 29 - 2*1 = 27
- Allele D: (in VCF: ACTTTTTGCGCCCACCGCGG/GCTTTTTGCGCCCACCGCGG, in browser: A/G)
Het = 266 - 2*2 = 262 == AD + BD