In today’s post, we’ll just be doing some…simple counting with Genotype Counts (GC) at a Cambridge pub. π
The general form of GC with a ref allele A and multiple alternative alleles (B,C,D, etc.) is:
GC = AA, AB,BB, AC,BC,CC, AD,BD,CD,DD, ...
GC fields basically capture the likelihood of two events occurring simultaneously at two spots. These events could be the same or different from each other. A is the most likely event to happen while B, C, D occur (usually) with much less likelihood.
Let’s say you’re visiting Cambridge on a Tuesday evening and you want to go to a pub and watch football. You heard from the locals that there are two cosy sports pubs close to Parker’s piece in the city centre, the Grain Store and the Alma. They told you that both pubs show all sorts of sports, from football and cricket to rugby and basketball (errr….ok maybe not basketball but let’s keep it like that for the sake of our story π).
However, some sports are obviously more popular than others and thus it’s more likely for a pub to show e.g. a football game than a cricket game. You might guess that both pubs would show football more frequently, with rugby, cricket, basketball and others (k sports in total) following in order of decreasing frequencies. So, to formulate that in “plain human language”, we have k types of events that can happen in either of our lovely pubs:
- A: football is shown at either pub
- B: rugby is shown at either pub
- C: cricket is shown at either pub
- … (another sport is shown at either pub)
- N: basketball is shown at either pub
You did some research the night before and based on historical records available on PUBmed, the frequency pe, e={A,B,C,..,N} that each sport is shown either at the Alma or the Grain Store follows the order: pN < … < pC < pB < pA . Additionally, you were able to get the full list of probabilities for the combinations of all sports being shown at these pubs at the same time, i.e. the full series L = (AA, AB, BB, AC, BC, CC, …, AN, BN, CN, …, NN).
You’d probably be lucky enough to go to any of the two pubs and still be able to watch football on that night, so you decide to just go with the Alma. However, the inner statistician in you can’t stop thinking about the probabilities of every possible combination of sports being shown in the Alma and the Grain Store on that night, or actually on every night.
For a given sport x, x β {A, B, C, …, N} being shown in the Alma, you would like to know the combined probabilities of the Grain Store showing sport y β {A, B, C, …, N} at the same time. So, you’d like to be able to retrieve from L all pairs that have either x or y in them.
If we go back to genetics and Genotype Counts, you’d just want to get the probabilities with which a specific alternative allele (e.g. F) occurs in every combination with any of the alleles (including F and the reference allele A).
To solve that issue, I wrote some code recently which I though might be useful to others too:
https://github.com/dvitsios/genomics-factory/blob/master/gc_query_module.py
P.S.: Of course, the sports pub analogy as presented here is not exactly equivalent with the subject of allele frequencies etc. but it still captures some pretty intuitive facts about this area! π
import sys def get_num_of_alleles_from_gc(gc_len): sum = 0 for i in range(1, gc_len): sum += i if sum == gc_len: return i def get_gc_indices_for_alt_allele(num_alleles, alt_idx=0, verbose=True): # --- Sanity check for input arguments --- if (not isinstance(num_alleles, int)) or num_alleles < 2: raise ValueError('num_alleles must be integer >= 2') if (not isinstance(alt_idx, int)) or alt_idx < 0: raise ValueError('alt_idx must be non negative integer') if alt_idx > num_alleles - 2: raise ValueError('alt_idx must be <= (num_alleles - 2)') # ----------------------------------------- # list of uppercase letters: ['A', 'B', 'C', ...] generic_allele_lookup_table = list(map(chr, range(65, 91))) allele_idx = alt_idx + 1 cur_pairs_list = [] for i in range(num_alleles): for j in range(i+1): ref_alt_pair = generic_allele_lookup_table[i] + generic_allele_lookup_table[j] ref_alt_pair = ref_alt_pair[::-1] cur_pairs_list.append(ref_alt_pair) alt_allele = generic_allele_lookup_table[ allele_idx ] indices = [i for i,s in enumerate(cur_pairs_list) if alt_allele in s] if verbose: print(" Alt-allele:",alt_allele) print(" Indices:", indices) print(" All pairs:", cur_pairs_list) print(" Selected pairs:", [cur_pairs_list[i] for i in indices]) return indices def get_limited_gc_indices_for_alt_allele(num_alleles, alt_idx=0, verbose=True): # --- Sanity check for input arguments --- if (not isinstance(num_alleles, int)) or num_alleles < 2: raise ValueError('num_alleles must be integer >= 2') if (not isinstance(alt_idx, int)) or alt_idx < 0: raise ValueError('alt_idx must be non negative integer') if alt_idx > num_alleles - 2: raise ValueError('alt_idx must be <= (num_alleles - 2)') # ----------------------------------------- # list of uppercase letters: ['A', 'B', 'C', ...] generic_allele_lookup_table = list(map(chr, range(65, 91))) allele_idx = alt_idx + 1 cur_pairs_list = [] for i in range(num_alleles): for j in range(i+1): ref_alt_pair = generic_allele_lookup_table[i] + generic_allele_lookup_table[j] ref_alt_pair = ref_alt_pair[::-1] cur_pairs_list.append(ref_alt_pair) alt_allele = generic_allele_lookup_table[ allele_idx ] limited_pairs = [] limited_pairs.append('A' + alt_allele) # AB limited_pairs.append(alt_allele * 2) # BB indices = [i for i,pair in enumerate(cur_pairs_list) if pair in limited_pairs] if verbose: print(" Alt-allele:",alt_allele) print(" Indices:", indices) print(" All pairs:", cur_pairs_list) print(" Limited pairs:", limited_pairs) print(" Selected pairs:", [cur_pairs_list[i] for i in indices]) return indices def run_examples(): # Examples GC_list = [[5601,3018,5795], [685, 13609, 272, 26, 17, 0, 6, 3, 0, 0]] for i in range(len(GC_list)): GC = GC_list[i] print("\n>> Example " + str(i+1) + ":", GC) num_alleles = get_num_of_alleles_from_gc(len(GC)) print(" - Num. of alleles:", num_alleles) for alt_idx in range(num_alleles-1): print("\n - Look up alt-allele at index:", alt_idx) print(" ... get_gc_indices_for_alt_allele(", num_alleles, ",", alt_idx, "):\n") ret_idxs = get_gc_indices_for_alt_allele(num_alleles, alt_idx) ret_idxs = get_limited_gc_indices_for_alt_allele(num_alleles, alt_idx) if __name__ == '__main__': print(">> get_gc_indices_for_alt_allele(num_alleles = 4, alt_idx = 1):\n") ret_idxs = get_gc_indices_for_alt_allele(4, 2) print(ret_idxs) print('+++++++++++++++++++++') ret_limited = get_limited_gc_indices_for_alt_allele(4, 1) print(ret_limited) run_examples()