BioMart is an amazing resource of well curated genomic annotations – till you need to actually download data programatically…
I gave it a try for a couple of hours using the biomaRt
R package only to realise my query wouldn’t be served in our lifetime…
However, I then moved on to try using Biomart’s REST API.
That’s a very decent option (in case you are not a Perl user who wants to use the native Perl API
of BioMart) and it’s pretty fast (relative to biomaRt always) and robust.
For any bulk data queries, you can wrap your requests over chunks of data or individual IDs each time.
For example, in order to retrieve the full annotation for a list of GO IDs using Python3
, you can iterate through each GO ID:
import requests, sys server = "https://rest.ensembl.org" ext_prefix = "/ontology/id/" def get_go_term_by_id(id): ext = ext_prefix + id + "?content-type=application/json" r = requests.get(server+ext, headers={ "Content-Type" : "application/json"}) if not r.ok: r.raise_for_status() sys.exit() decoded = r.json() print(repr(decoded)) go_ids = ['GO:0006958', 'GO:0031902', 'GO:0050776'] for id in go_ids: get_go_term_by_id(id)
Each individual call gets served in less than a second (usually a few millisecs) so no need to worry for any server timeout issues.
Besides, you can wrap your code in try/except
blocks so that in case an individual call crashes it can continue processing the rest as normal.
Simple multi-threading with map and pool in Python3
For real speed performance, you can submit multiple requests at once to BioMart using parallel threads in Python (from the multiprocessing
module):
from multiprocessing.dummy import Pool as ThreadPool # making 50 simultaneous requests num_threads = 50 pool = ThreadPool(num_threads) server = "https://rest.ensembl.org" ext_prefix = "/ontology/id/" go_id_terms_dict = {} def get_go_term_by_id(id): try: ext = ext_prefix + id + "?content-type=application/json" r = requests.get(server+ext, headers={ "Content-Type" : "application/json"}) if not r.ok: r.raise_for_status() return "" decoded = r.json() go_term = repr(decoded['name']) go_term = go_term.replace("'", "") go_term = go_term.replace("\"", "") except: go_term = "" print('[Warning] Could not fetch GO term for ID:', id) go_id_terms_dict[id] = go_term result = id + '||' + go_term print(result) # all_human_go_ids: your list with GO IDs pool.map(get_go_term_by_id, all_human_go_ids)