Get ancient information from ensemble compara

Get ancient information from ensemble compara#

Try to collect ancient alleles from ensemble compara.

from urllib.parse import urljoin
from tqdm.notebook import tqdm
import pandas as pd

from ensemblrest import EnsemblRest
from tskitetude.smarterapi import VariantsEndpoint
from tskitetude.ensembl import ComparaSheepSNP

If I want to recover the coordinates of old OAR3 I need to connect to the compara archive:

base_url = 'https://nov2020.rest.ensembl.org'
ensRest = EnsemblRest(base_url=base_url)
session = ensRest.session

Get information on assembly. Mind to the top level coordinates, which are the assembled chromosomes.

data = ensRest.getInfoAssembly(species="ovis_aries")
chromosomes = pd.DataFrame(list(filter(lambda record: record['coord_system'] == "chromosome", data["top_level_region"])))
chromosomes.head()

	coord_system	name	length
0	chromosome	1	275612895
1	chromosome	10	86447213
2	chromosome	11	62248096
3	chromosome	12	79100223
4	chromosome	13	83079144

A simple test to collect alignments:

url = urljoin(base_url, "alignment/region/ovis_aries/1:649900..650000")
params = [
    ('method', 'EPO'),
    ('species_set_group', 'mammals'),
    ('display_species_set', 'ovis_aries'),
    ('display_species_set', 'capra_hircus')
]

response = session.get(url, params=params)
response.json()

[{'alignments': [{'species': 'capra_hircus',
    'end': 818868,
    'seq': 'TGGAAGCAGCAGACGCAAGCCATGCCTG--AAGCT-CTGCAGCCTCGTGGAGGGAAAGAG--A-AAGCGCCAGACCGCAGAGCCAGCTCGAG---GTGA-GGGGTCTGCTA',
    'description': '',
    'seq_region': '3',
    'strand': 1,
    'start': 818768},
   {'strand': -1,
    'start': 37294,
    'species': 'Chir-Oari[2]',
    'end': 37394,
    'seq': 'TGGAAGCAGCAGACGCAAGCCATGCCTG--AAGCT-CTGCAGCCTCGTGGAGGGAAAGAG--A-ACGCGCCAGACCGCAGAGCCAGCTCGAG---GTGA-GGGGTCTGCTA',
    'seq_region': 'Ancestor_1888_968487',
    'description': ''},
   {'species': 'ovis_aries',
    'end': 650000,
    'seq': 'TGGAAGCAGCAGACGCAAGCCATGCCTG--AAGCT-CTGCAGCCTCGTGGAGGGAAAGAG--A-GCGCGCCAGACCGCAGAGCCAGCTCGAG---GTGA-GGGGTCTGCTA',
    'description': '',
    'seq_region': '1',
    'strand': 1,
    'start': 649900}],
  'tree': '(capra_hircus_3_818768_818868[+]:0.01130222,ovis_aries_1_649900_650000[+]:0.0111331)Chir-Oari[2]:0.0169365;'}]

Ok, now try to collect some variant from 50K from the SMARTER API endpoint:

variant_api = VariantsEndpoint(species="Sheep", assembly="OAR3")

region = "1:1-5000000"
chip_name = "IlluminaOvineSNP50"

data = variant_api.get_variants(chip_name=chip_name, region=region)
page = page = data["page"]
variants = pd.json_normalize(data["items"])

while data["next"] is not None:
    data = variant_api.get_variants(chip_name=chip_name, region=region, page=page+1)
    df_page = pd.json_normalize(data["items"])
    page = data["page"]
    variants = pd.concat([variants, df_page], ignore_index=True)

variants.info()

2025-12-19 13:29:24,134 - tskitetude.smarterapi - INFO - Initialized VariantsEndpoint with URL: https://webserver.ibba.cnr.it/smarter-api/variants/sheep/OAR3

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 23 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 affy_snp_id                       88 non-null     object
 chip_name                         90 non-null     object
 cust_id                           69 non-null     object
 name                              90 non-null     object
 probesets                         88 non-null     object
 rs_id                             90 non-null     object
 _id.$oid                          90 non-null     object
 locations.alleles                 90 non-null     object
 locations.chrom                   90 non-null     object
 locations.illumina                90 non-null     object
locations.illumina_forward        90 non-null     object
locations.illumina_strand         86 non-null     object
locations.illumina_top            90 non-null     object
locations.imported_from           90 non-null     object
locations.position                90 non-null     int64 
locations.ss_id                   90 non-null     object
locations.strand                  90 non-null     object
locations.version                 90 non-null     object
sequence.AffymetrixAxiomBGovis2   76 non-null     object
sequence.AffymetrixAxiomBGovisNP  84 non-null     object
sequence.AffymetrixAxiomOviCan    73 non-null     object
sequence.IlluminaOvineHDSNP       71 non-null     object
sequence.IlluminaOvineSNP50       90 non-null     object
dtypes: int64(1), object(22)
memory usage: 16.3+ KB

Get a sample of the positions of the selected variants:

variants[["rs_id", "locations.chrom", "locations.position", "locations.alleles"]].head()

	rs_id	locations.chrom	locations.position	locations.alleles
0	[rs430360910]	1	52854	A/G
1	[rs402427398]	1	81978	C/G
2	[rs413624639]	1	120098	C/T
3	[rs424810706]	1	204694	A/G
4	[rs55630911]	1	315497	A/G

Get an helper object to query ensemble compara:

compara = ComparaSheepSNP(base_url=base_url)

2025-12-19 13:29:24,336 - tskitetude.ensembl - INFO - Using base URL: https://nov2020.rest.ensembl.org

Now try to collect ancestor alleles from alignments:

results = []
for idx, variant in tqdm(variants.iterrows(), total=variants.shape[0]):
    ancestor = compara.get_ancestor(chrom=variant["locations.chrom"], position=variant["locations.position"])
    results.append(ancestor)

Get only ancestor allele when an alignment is found:

ancestor_alleles = [None if result is None else result["seq"] for result in results]

Add ancestor allele to the variant dataframe:

variants["ancestor_alleles"] = ancestor_alleles
filtered_variants = variants[variants["ancestor_alleles"].notna()]
filtered_variants[["rs_id", "locations.chrom", "locations.position", "locations.alleles", "ancestor_alleles"]].head()

	rs_id	locations.chrom	locations.position	locations.alleles	ancestor_alleles
16	[rs399425964]	1	1193562	C/T	C
17	[rs404997262]	1	1245222	A/G	T
18	[rs430138341]	1	1390945	A/C	T
19	[rs401390747]	1	1406764	C/G	G
20	[rs398738941]	1	1513820	A/G	T

Please note that I can collect an ancestral allele which is not present in the alleles itself