Get ancient information from ensemble compara

Get ancient information from ensemble compara#

Try to collect ancient alleles from ensemble compara.

from urllib.parse import urljoin
from tqdm.notebook import tqdm
import pandas as pd

from ensemblrest import EnsemblRest
from tskitetude.smarterapi import VariantsEndpoint
from tskitetude.ensembl import ComparaSheepSNP

If I want to recover the coordinates of old OAR3 I need to connect to the compara archive:

base_url = 'https://nov2020.rest.ensembl.org'
ensRest = EnsemblRest(base_url=base_url)
session = ensRest.session

Get information on assembly. Mind to the top level coordinates, which are the assembled chromosomes.

data = ensRest.getInfoAssembly(species="ovis_aries")
chromosomes = pd.DataFrame(list(filter(lambda record: record['coord_system'] == "chromosome", data["top_level_region"])))
chromosomes.head()
coord_system name length
0 chromosome 1 275612895
1 chromosome 10 86447213
2 chromosome 11 62248096
3 chromosome 12 79100223
4 chromosome 13 83079144

A simple test to collect alignments:

url = urljoin(base_url, "alignment/region/ovis_aries/1:649900..650000")
params = [
    ('method', 'EPO'),
    ('species_set_group', 'mammals'),
    ('display_species_set', 'ovis_aries'),
    ('display_species_set', 'capra_hircus')
]

response = session.get(url, params=params)
response.json()
[{'alignments': [{'species': 'capra_hircus',
    'end': 818868,
    'seq': 'TGGAAGCAGCAGACGCAAGCCATGCCTG--AAGCT-CTGCAGCCTCGTGGAGGGAAAGAG--A-AAGCGCCAGACCGCAGAGCCAGCTCGAG---GTGA-GGGGTCTGCTA',
    'description': '',
    'seq_region': '3',
    'strand': 1,
    'start': 818768},
   {'strand': -1,
    'start': 37294,
    'species': 'Chir-Oari[2]',
    'end': 37394,
    'seq': 'TGGAAGCAGCAGACGCAAGCCATGCCTG--AAGCT-CTGCAGCCTCGTGGAGGGAAAGAG--A-ACGCGCCAGACCGCAGAGCCAGCTCGAG---GTGA-GGGGTCTGCTA',
    'seq_region': 'Ancestor_1888_968487',
    'description': ''},
   {'species': 'ovis_aries',
    'end': 650000,
    'seq': 'TGGAAGCAGCAGACGCAAGCCATGCCTG--AAGCT-CTGCAGCCTCGTGGAGGGAAAGAG--A-GCGCGCCAGACCGCAGAGCCAGCTCGAG---GTGA-GGGGTCTGCTA',
    'description': '',
    'seq_region': '1',
    'strand': 1,
    'start': 649900}],
  'tree': '(capra_hircus_3_818768_818868[+]:0.01130222,ovis_aries_1_649900_650000[+]:0.0111331)Chir-Oari[2]:0.0169365;'}]

Ok, now try to collect some variant from 50K from the SMARTER API endpoint:

variant_api = VariantsEndpoint(species="Sheep", assembly="OAR3")

region = "1:1-5000000"
chip_name = "IlluminaOvineSNP50"

data = variant_api.get_variants(chip_name=chip_name, region=region)
page = page = data["page"]
variants = pd.json_normalize(data["items"])

while data["next"] is not None:
    data = variant_api.get_variants(chip_name=chip_name, region=region, page=page+1)
    df_page = pd.json_normalize(data["items"])
    page = data["page"]
    variants = pd.concat([variants, df_page], ignore_index=True)

variants.info()
2025-12-19 13:29:24,134 - tskitetude.smarterapi - INFO - Initialized VariantsEndpoint with URL: https://webserver.ibba.cnr.it/smarter-api/variants/sheep/OAR3
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 23 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   affy_snp_id                       88 non-null     object
 1   chip_name                         90 non-null     object
 2   cust_id                           69 non-null     object
 3   name                              90 non-null     object
 4   probesets                         88 non-null     object
 5   rs_id                             90 non-null     object
 6   _id.$oid                          90 non-null     object
 7   locations.alleles                 90 non-null     object
 8   locations.chrom                   90 non-null     object
 9   locations.illumina                90 non-null     object
 10  locations.illumina_forward        90 non-null     object
 11  locations.illumina_strand         86 non-null     object
 12  locations.illumina_top            90 non-null     object
 13  locations.imported_from           90 non-null     object
 14  locations.position                90 non-null     int64 
 15  locations.ss_id                   90 non-null     object
 16  locations.strand                  90 non-null     object
 17  locations.version                 90 non-null     object
 18  sequence.AffymetrixAxiomBGovis2   76 non-null     object
 19  sequence.AffymetrixAxiomBGovisNP  84 non-null     object
 20  sequence.AffymetrixAxiomOviCan    73 non-null     object
 21  sequence.IlluminaOvineHDSNP       71 non-null     object
 22  sequence.IlluminaOvineSNP50       90 non-null     object
dtypes: int64(1), object(22)
memory usage: 16.3+ KB

Get a sample of the positions of the selected variants:

variants[["rs_id", "locations.chrom", "locations.position", "locations.alleles"]].head()
rs_id locations.chrom locations.position locations.alleles
0 [rs430360910] 1 52854 A/G
1 [rs402427398] 1 81978 C/G
2 [rs413624639] 1 120098 C/T
3 [rs424810706] 1 204694 A/G
4 [rs55630911] 1 315497 A/G

Get an helper object to query ensemble compara:

compara = ComparaSheepSNP(base_url=base_url)
2025-12-19 13:29:24,336 - tskitetude.ensembl - INFO - Using base URL: https://nov2020.rest.ensembl.org

Now try to collect ancestor alleles from alignments:

results = []
for idx, variant in tqdm(variants.iterrows(), total=variants.shape[0]):
    ancestor = compara.get_ancestor(chrom=variant["locations.chrom"], position=variant["locations.position"])
    results.append(ancestor)

Get only ancestor allele when an alignment is found:

ancestor_alleles = [None if result is None else result["seq"] for result in results]

Add ancestor allele to the variant dataframe:

variants["ancestor_alleles"] = ancestor_alleles
filtered_variants = variants[variants["ancestor_alleles"].notna()]
filtered_variants[["rs_id", "locations.chrom", "locations.position", "locations.alleles", "ancestor_alleles"]].head()
rs_id locations.chrom locations.position locations.alleles ancestor_alleles
16 [rs399425964] 1 1193562 C/T C
17 [rs404997262] 1 1245222 A/G T
18 [rs430138341] 1 1390945 A/C T
19 [rs401390747] 1 1406764 C/G G
20 [rs398738941] 1 1513820 A/G T

Please note that I can collect an ancestral allele which is not present in the alleles itself