Get ancient information from ensemble compara#
Try to collect ancient alleles from ensemble compara.
from urllib.parse import urljoin
from tqdm.notebook import tqdm
import pandas as pd
from ensemblrest import EnsemblRest
from tskitetude.smarterapi import VariantsEndpoint
from tskitetude.ensembl import ComparaSheepSNP
If I want to recover the coordinates of old OAR3 I need to connect to the compara
archive:
base_url = 'https://nov2020.rest.ensembl.org'
ensRest = EnsemblRest(base_url=base_url)
session = ensRest.session
Get information on assembly. Mind to the top level coordinates, which are the assembled chromosomes.
data = ensRest.getInfoAssembly(species="ovis_aries")
chromosomes = pd.DataFrame(list(filter(lambda record: record['coord_system'] == "chromosome", data["top_level_region"])))
chromosomes.head()
| coord_system | name | length | |
|---|---|---|---|
| 0 | chromosome | 1 | 275612895 |
| 1 | chromosome | 10 | 86447213 |
| 2 | chromosome | 11 | 62248096 |
| 3 | chromosome | 12 | 79100223 |
| 4 | chromosome | 13 | 83079144 |
A simple test to collect alignments:
url = urljoin(base_url, "alignment/region/ovis_aries/1:649900..650000")
params = [
('method', 'EPO'),
('species_set_group', 'mammals'),
('display_species_set', 'ovis_aries'),
('display_species_set', 'capra_hircus')
]
response = session.get(url, params=params)
response.json()
[{'alignments': [{'species': 'capra_hircus',
'end': 818868,
'seq': 'TGGAAGCAGCAGACGCAAGCCATGCCTG--AAGCT-CTGCAGCCTCGTGGAGGGAAAGAG--A-AAGCGCCAGACCGCAGAGCCAGCTCGAG---GTGA-GGGGTCTGCTA',
'description': '',
'seq_region': '3',
'strand': 1,
'start': 818768},
{'strand': -1,
'start': 37294,
'species': 'Chir-Oari[2]',
'end': 37394,
'seq': 'TGGAAGCAGCAGACGCAAGCCATGCCTG--AAGCT-CTGCAGCCTCGTGGAGGGAAAGAG--A-ACGCGCCAGACCGCAGAGCCAGCTCGAG---GTGA-GGGGTCTGCTA',
'seq_region': 'Ancestor_1888_968487',
'description': ''},
{'species': 'ovis_aries',
'end': 650000,
'seq': 'TGGAAGCAGCAGACGCAAGCCATGCCTG--AAGCT-CTGCAGCCTCGTGGAGGGAAAGAG--A-GCGCGCCAGACCGCAGAGCCAGCTCGAG---GTGA-GGGGTCTGCTA',
'description': '',
'seq_region': '1',
'strand': 1,
'start': 649900}],
'tree': '(capra_hircus_3_818768_818868[+]:0.01130222,ovis_aries_1_649900_650000[+]:0.0111331)Chir-Oari[2]:0.0169365;'}]
Ok, now try to collect some variant from 50K from the SMARTER API endpoint:
variant_api = VariantsEndpoint(species="Sheep", assembly="OAR3")
region = "1:1-5000000"
chip_name = "IlluminaOvineSNP50"
data = variant_api.get_variants(chip_name=chip_name, region=region)
page = page = data["page"]
variants = pd.json_normalize(data["items"])
while data["next"] is not None:
data = variant_api.get_variants(chip_name=chip_name, region=region, page=page+1)
df_page = pd.json_normalize(data["items"])
page = data["page"]
variants = pd.concat([variants, df_page], ignore_index=True)
variants.info()
2025-12-19 13:29:24,134 - tskitetude.smarterapi - INFO - Initialized VariantsEndpoint with URL: https://webserver.ibba.cnr.it/smarter-api/variants/sheep/OAR3
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 affy_snp_id 88 non-null object
1 chip_name 90 non-null object
2 cust_id 69 non-null object
3 name 90 non-null object
4 probesets 88 non-null object
5 rs_id 90 non-null object
6 _id.$oid 90 non-null object
7 locations.alleles 90 non-null object
8 locations.chrom 90 non-null object
9 locations.illumina 90 non-null object
10 locations.illumina_forward 90 non-null object
11 locations.illumina_strand 86 non-null object
12 locations.illumina_top 90 non-null object
13 locations.imported_from 90 non-null object
14 locations.position 90 non-null int64
15 locations.ss_id 90 non-null object
16 locations.strand 90 non-null object
17 locations.version 90 non-null object
18 sequence.AffymetrixAxiomBGovis2 76 non-null object
19 sequence.AffymetrixAxiomBGovisNP 84 non-null object
20 sequence.AffymetrixAxiomOviCan 73 non-null object
21 sequence.IlluminaOvineHDSNP 71 non-null object
22 sequence.IlluminaOvineSNP50 90 non-null object
dtypes: int64(1), object(22)
memory usage: 16.3+ KB
Get a sample of the positions of the selected variants:
variants[["rs_id", "locations.chrom", "locations.position", "locations.alleles"]].head()
| rs_id | locations.chrom | locations.position | locations.alleles | |
|---|---|---|---|---|
| 0 | [rs430360910] | 1 | 52854 | A/G |
| 1 | [rs402427398] | 1 | 81978 | C/G |
| 2 | [rs413624639] | 1 | 120098 | C/T |
| 3 | [rs424810706] | 1 | 204694 | A/G |
| 4 | [rs55630911] | 1 | 315497 | A/G |
Get an helper object to query ensemble compara:
compara = ComparaSheepSNP(base_url=base_url)
2025-12-19 13:29:24,336 - tskitetude.ensembl - INFO - Using base URL: https://nov2020.rest.ensembl.org
Now try to collect ancestor alleles from alignments:
results = []
for idx, variant in tqdm(variants.iterrows(), total=variants.shape[0]):
ancestor = compara.get_ancestor(chrom=variant["locations.chrom"], position=variant["locations.position"])
results.append(ancestor)
Get only ancestor allele when an alignment is found:
ancestor_alleles = [None if result is None else result["seq"] for result in results]
Add ancestor allele to the variant dataframe:
variants["ancestor_alleles"] = ancestor_alleles
filtered_variants = variants[variants["ancestor_alleles"].notna()]
filtered_variants[["rs_id", "locations.chrom", "locations.position", "locations.alleles", "ancestor_alleles"]].head()
| rs_id | locations.chrom | locations.position | locations.alleles | ancestor_alleles | |
|---|---|---|---|---|---|
| 16 | [rs399425964] | 1 | 1193562 | C/T | C |
| 17 | [rs404997262] | 1 | 1245222 | A/G | T |
| 18 | [rs430138341] | 1 | 1390945 | A/C | T |
| 19 | [rs401390747] | 1 | 1406764 | C/G | G |
| 20 | [rs398738941] | 1 | 1513820 | A/G | T |
Please note that I can collect an ancestral allele which is not present in the alleles itself