Infer ancestry using EST-SFS

Infer ancestry using EST-SFS#

This was the first attempt to create tstree object using EST-SFS. In this example we collect all background samples from SMARTER database and we use all Ovis aries samples as focal samples, and european, sardinian and spanish mouflon as three different outgroups to make inference with EST-SFS. We will write all selection of samples as CSV files with FID and IID columns in order to extract from the whole genotype files only the samples we need.

Try to collect sheep background samples from SMARTER database:

import pandas as pd

from tskitetude import get_data_dir
from tskitetude.smarterapi import SheepEndpoint, BreedEndpoint

Connect to SMARTER database and retrieve information on background samples:

sheep_api = SheepEndpoint()

data = sheep_api.get_samples(_type="background")
page = 1
sheep = pd.DataFrame(data["items"])

while data["next"] is not None:
    data = sheep_api.get_samples(page=page+1, _type="background")
    df_page = pd.DataFrame(data["items"])
    page = data["page"]
    sheep = pd.concat([sheep, df_page], ignore_index=True)

sheep.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5892 entries, 0 to 5891
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   _id          5892 non-null   object 
 1   breed        5892 non-null   object 
 2   breed_code   5892 non-null   object 
 3   chip_name    5892 non-null   object 
 4   country      5892 non-null   object 
 5   dataset_id   5892 non-null   object 
 6   locations    4849 non-null   object 
 7   metadata     4878 non-null   object 
 8   original_id  5892 non-null   object 
 9   phenotype    577 non-null    object 
 10  smarter_id   5892 non-null   object 
 11  species      5892 non-null   object 
 12  type         5892 non-null   object 
 13  father_id    58 non-null     object 
 14  mother_id    58 non-null     object 
 15  sex          501 non-null    float64
 16  alias        156 non-null    object 
dtypes: float64(1), object(16)
memory usage: 782.7+ KB

Are those all background samples?

sheep.value_counts("type")
type
background    5892
Name: count, dtype: int64

Ok. Let’s collect all available species:

sheep.value_counts("species")
species
Ovis aries                5739
Ovis aries musimon         134
Ovis orientalis             16
Ovis orientalis ophion       3
Name: count, dtype: int64

Ok, now collect all samples which are Ovis aries:

ovis_aries = sheep[sheep["species"] == "Ovis aries"]
ovis_aries.head()
_id breed breed_code chip_name country dataset_id locations metadata original_id phenotype smarter_id species type father_id mother_id sex alias
0 {'$oid': '66544dd515d202ad5f28dc8e'} Texel TEX IlluminaOvineSNP50 Uruguay {'$oid': '604f75a61a08c53cebd09b67'} {'coordinates': [[-54.79980235632396, -32.8596... {'gps_2': 'https://www.google.com/maps/place/E... 20181210002 {'purpose': 'Meat'} UYOA-TEX-000000001 Ovis aries background NaN NaN NaN NaN
1 {'$oid': '66544dd615d202ad5f28dc8f'} Texel TEX IlluminaOvineSNP50 Uruguay {'$oid': '604f75a61a08c53cebd09b67'} {'coordinates': [[-54.79980235632396, -32.8596... {'gps_2': 'https://www.google.com/maps/place/E... 20181210003 {'purpose': 'Meat'} UYOA-TEX-000000002 Ovis aries background NaN NaN NaN NaN
2 {'$oid': '66544dd615d202ad5f28dc90'} Texel TEX IlluminaOvineSNP50 Uruguay {'$oid': '604f75a61a08c53cebd09b67'} {'coordinates': [[-54.79980235632396, -32.8596... {'gps_2': 'https://www.google.com/maps/place/E... 20181210005 {'purpose': 'Meat'} UYOA-TEX-000000003 Ovis aries background NaN NaN NaN NaN
3 {'$oid': '66544dd715d202ad5f28dc91'} Texel TEX IlluminaOvineSNP50 Uruguay {'$oid': '604f75a61a08c53cebd09b67'} {'coordinates': [[-54.79980235632396, -32.8596... {'gps_2': 'https://www.google.com/maps/place/E... 20181210006 {'purpose': 'Meat'} UYOA-TEX-000000004 Ovis aries background NaN NaN NaN NaN
4 {'$oid': '66544dd715d202ad5f28dc92'} Texel TEX IlluminaOvineSNP50 Uruguay {'$oid': '604f75a61a08c53cebd09b67'} {'coordinates': [[-54.79980235632396, -32.8596... {'gps_2': 'https://www.google.com/maps/place/E... 20181210008 {'purpose': 'Meat'} UYOA-TEX-000000005 Ovis aries background NaN NaN NaN NaN

How many breeds I have?

ovis_aries.value_counts("breed")
breed
Frizarta          322
Corriedale        214
Texel             193
Churra            120
Soay              110
                 ... 
Spael-coloured      3
Dubska              2
Recka               2
Privorska           2
Sri Lankan          1
Name: count, Length: 202, dtype: int64

Ensure that there are no mouflon in sheep breed names:

ovis_aries["breed"].str.contains("Mouflon", case=False).any()
np.False_

Ok, now collect Ovis aries musimon samples:

ovis_aries_musimon = sheep[sheep["species"] == "Ovis aries musimon"]
ovis_aries_musimon.head()
_id breed breed_code chip_name country dataset_id locations metadata original_id phenotype smarter_id species type father_id mother_id sex alias
788 {'$oid': '665468ffe91476d662a729a2'} European mouflon EUR IlluminaOvineHDSNP France {'$oid': '604f75a61a08c53cebd09b58'} {'coordinates': [[9.0129, 42.0396]], 'type': '... {'link': 'https://en.wikipedia.org/wiki/Mouflo... EUR1 NaN FROA-EUR-000000789 Ovis aries musimon background NaN NaN NaN NaN
789 {'$oid': '66546903e91476d662a729a3'} European mouflon EUR IlluminaOvineHDSNP France {'$oid': '604f75a61a08c53cebd09b58'} {'coordinates': [[9.0129, 42.0396]], 'type': '... {'link': 'https://en.wikipedia.org/wiki/Mouflo... EUR2 NaN FROA-EUR-000000790 Ovis aries musimon background NaN NaN NaN NaN
4849 {'$oid': '66549c5268b308e4e558b2e1'} Spanish mouflon MSP IlluminaOvineSNP50 Spain {'$oid': '639b4b922cee7f5ed2014011'} NaN NaN EMF1 NaN ESOA-MSP-000010795 Ovis aries musimon background NaN NaN NaN NaN
4850 {'$oid': '66549c5268b308e4e558b2e2'} Spanish mouflon MSP IlluminaOvineSNP50 Spain {'$oid': '639b4b922cee7f5ed2014011'} NaN NaN EMF2 NaN ESOA-MSP-000010796 Ovis aries musimon background NaN NaN NaN NaN
4851 {'$oid': '66549c5268b308e4e558b2e3'} Spanish mouflon MSP IlluminaOvineSNP50 Spain {'$oid': '639b4b922cee7f5ed2014011'} NaN NaN EMF3 NaN ESOA-MSP-000010797 Ovis aries musimon background NaN NaN NaN NaN

How many breeds I have?

ovis_aries_musimon.value_counts("breed")
breed
Sardinian mouflon    79
European mouflon     23
Spanish mouflon      21
Hungarian mouflon     8
Corsican mouflon      3
Name: count, dtype: int64

Ok, try to collect European mouflon:

european_mouflon = ovis_aries_musimon[ovis_aries_musimon["breed"] == "European mouflon"]
european_mouflon.head()
_id breed breed_code chip_name country dataset_id locations metadata original_id phenotype smarter_id species type father_id mother_id sex alias
788 {'$oid': '665468ffe91476d662a729a2'} European mouflon EUR IlluminaOvineHDSNP France {'$oid': '604f75a61a08c53cebd09b58'} {'coordinates': [[9.0129, 42.0396]], 'type': '... {'link': 'https://en.wikipedia.org/wiki/Mouflo... EUR1 NaN FROA-EUR-000000789 Ovis aries musimon background NaN NaN NaN NaN
789 {'$oid': '66546903e91476d662a729a3'} European mouflon EUR IlluminaOvineHDSNP France {'$oid': '604f75a61a08c53cebd09b58'} {'coordinates': [[9.0129, 42.0396]], 'type': '... {'link': 'https://en.wikipedia.org/wiki/Mouflo... EUR2 NaN FROA-EUR-000000790 Ovis aries musimon background NaN NaN NaN NaN
5739 {'$oid': '66549c6eb20eb5227329c2ca'} European mouflon EUR IlluminaOvineSNP50 France {'$oid': '639b4b922cee7f5ed2014012'} {'coordinates': [[9.15, 42.31]], 'type': 'Mult... {'region': 'Corsica', 'type_': 'Mouflon'} EMF1 NaN FROA-EUR-000011685 Ovis aries musimon background NaN NaN NaN NaN
5740 {'$oid': '66549c6eb20eb5227329c2cb'} European mouflon EUR IlluminaOvineSNP50 France {'$oid': '639b4b922cee7f5ed2014012'} {'coordinates': [[9.15, 42.31]], 'type': 'Mult... {'region': 'Corsica', 'type_': 'Mouflon'} EMF2 NaN FROA-EUR-000011686 Ovis aries musimon background NaN NaN NaN NaN
5741 {'$oid': '66549c6eb20eb5227329c2cc'} European mouflon EUR IlluminaOvineSNP50 France {'$oid': '639b4b922cee7f5ed2014012'} {'coordinates': [[9.15, 42.31]], 'type': 'Mult... {'region': 'Corsica', 'type_': 'Mouflon'} EMF3 NaN FROA-EUR-000011687 Ovis aries musimon background NaN NaN NaN NaN

Ok, I’m also interested in Sardinian mouflon:

sardinian_mouflon = ovis_aries_musimon[ovis_aries_musimon["breed"] == "Sardinian mouflon"]
sardinian_mouflon.head()
_id breed breed_code chip_name country dataset_id locations metadata original_id phenotype smarter_id species type father_id mother_id sex alias
4881 {'$oid': '66549c5368b308e4e558b301'} Sardinian mouflon MSA IlluminaOvineSNP50 Italy {'$oid': '639b4b922cee7f5ed2014011'} NaN NaN 72_MufloneS NaN ITOA-MSA-000010827 Ovis aries musimon background NaN NaN NaN NaN
4882 {'$oid': '66549c5368b308e4e558b302'} Sardinian mouflon MSA IlluminaOvineSNP50 Italy {'$oid': '639b4b922cee7f5ed2014011'} NaN NaN 73_MufloneS NaN ITOA-MSA-000010828 Ovis aries musimon background NaN NaN NaN NaN
4883 {'$oid': '66549c5368b308e4e558b303'} Sardinian mouflon MSA IlluminaOvineSNP50 Italy {'$oid': '639b4b922cee7f5ed2014011'} NaN NaN 74_MufloneS NaN ITOA-MSA-000010829 Ovis aries musimon background NaN NaN NaN NaN
4884 {'$oid': '66549c5368b308e4e558b304'} Sardinian mouflon MSA IlluminaOvineSNP50 Italy {'$oid': '639b4b922cee7f5ed2014011'} NaN NaN 75_MufloneS NaN ITOA-MSA-000010830 Ovis aries musimon background NaN NaN NaN NaN
4885 {'$oid': '66549c5368b308e4e558b305'} Sardinian mouflon MSA IlluminaOvineSNP50 Italy {'$oid': '639b4b922cee7f5ed2014011'} NaN NaN 76_MufloneS NaN ITOA-MSA-000010831 Ovis aries musimon background NaN NaN NaN NaN

Should I take Spanish mouflon as third outgroup?

spanish_mouflon = ovis_aries_musimon[ovis_aries_musimon["breed"] == "Spanish mouflon"]
spanish_mouflon.head()
_id breed breed_code chip_name country dataset_id locations metadata original_id phenotype smarter_id species type father_id mother_id sex alias
4849 {'$oid': '66549c5268b308e4e558b2e1'} Spanish mouflon MSP IlluminaOvineSNP50 Spain {'$oid': '639b4b922cee7f5ed2014011'} NaN NaN EMF1 NaN ESOA-MSP-000010795 Ovis aries musimon background NaN NaN NaN NaN
4850 {'$oid': '66549c5268b308e4e558b2e2'} Spanish mouflon MSP IlluminaOvineSNP50 Spain {'$oid': '639b4b922cee7f5ed2014011'} NaN NaN EMF2 NaN ESOA-MSP-000010796 Ovis aries musimon background NaN NaN NaN NaN
4851 {'$oid': '66549c5268b308e4e558b2e3'} Spanish mouflon MSP IlluminaOvineSNP50 Spain {'$oid': '639b4b922cee7f5ed2014011'} NaN NaN EMF3 NaN ESOA-MSP-000010797 Ovis aries musimon background NaN NaN NaN NaN
4852 {'$oid': '66549c5268b308e4e558b2e4'} Spanish mouflon MSP IlluminaOvineSNP50 Spain {'$oid': '639b4b922cee7f5ed2014011'} NaN NaN EMF4 NaN ESOA-MSP-000010798 Ovis aries musimon background NaN NaN NaN NaN
4853 {'$oid': '66549c5268b308e4e558b2e5'} Spanish mouflon MSP IlluminaOvineSNP50 Spain {'$oid': '639b4b922cee7f5ed2014011'} NaN NaN EMF5 NaN ESOA-MSP-000010799 Ovis aries musimon background NaN NaN NaN NaN

Ok, now track those breeds as three different outgroup list:

european_mouflon[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "european_mouflon.tsv", index=False, header=False, sep="\t")
sardinian_mouflon[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "sardinian_mouflon.tsv", index=False, header=False, sep="\t")
spanish_mouflon[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "spanish_mouflon.tsv", index=False, header=False, sep="\t")

Now, create a sample txt file which I can use to extract the focal sample I need from smarter database using plink:

ovis_aries[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "sheep_dataset.tsv", index=False, header=False, sep="\t")

Attempt to limit sample size#

Ok try to download a small dataset to test the pipeline: get information about breeds:

breed_api = BreedEndpoint()

data = breed_api.get_breeds(species="Sheep")
page = 1
breeds = pd.DataFrame(data["items"])

while data["next"] is not None:
    data = breed_api.get_breeds(page=page+1, species="Sheep")
    df_page = pd.DataFrame(data["items"])
    page = data["page"]
    breeds = pd.concat([breeds, df_page], ignore_index=True)

breeds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 285 entries, 0 to 284
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   _id            285 non-null    object
 1   aliases        285 non-null    object
 2   code           285 non-null    object
 3   n_individuals  285 non-null    int64 
 4   name           285 non-null    object
 5   species        285 non-null    object
dtypes: int64(1), object(5)
memory usage: 13.5+ KB

Try to select samples with a limited number of individuals, for example 50:

breeds[breeds["n_individuals"] == 50]
_id aliases code n_individuals name species
30 {'$oid': '66544d66332b746be8c662d3'} [{'country': 'France', 'dataset_id': {'$oid': ... IDF 50 Île de France Sheep
54 {'$oid': '66544d67682777165324011b'} [{'country': 'Spain', 'dataset_id': {'$oid': '... AME 50 AustralianMerino Sheep

Ok first focus on AustralianMerino breed:

data = sheep_api.get_samples(code="AME")
page = 1
sheep = pd.DataFrame(data["items"])
sheep.head()
_id breed breed_code chip_name country dataset_id locations metadata original_id smarter_id species type sex father_id mother_id
0 {'$oid': '66547865a8ef9ad0ab0bd77f'} AustralianMerino AME IlluminaOvineSNP50 Spain {'$oid': '604f75a61a08c53cebd09b5e'} {'coordinates': [[-6.33333333333333, 39.166666... {'location_source': 'Pedrosa et al. 2007, GSE ... MER31 ESOA-AME-000001530 Ovis aries background NaN NaN NaN
1 {'$oid': '66547866a8ef9ad0ab0bd780'} AustralianMerino AME IlluminaOvineSNP50 Spain {'$oid': '604f75a61a08c53cebd09b5e'} {'coordinates': [[-6.33333333333333, 39.166666... {'location_source': 'Pedrosa et al. 2007, GSE ... MER47 ESOA-AME-000001531 Ovis aries background 2.0 NaN NaN
2 {'$oid': '66547866a8ef9ad0ab0bd781'} AustralianMerino AME IlluminaOvineSNP50 Spain {'$oid': '604f75a61a08c53cebd09b5e'} {'coordinates': [[-6.33333333333333, 39.166666... {'location_source': 'Pedrosa et al. 2007, GSE ... MER1 ESOA-AME-000001532 Ovis aries background NaN NaN NaN
3 {'$oid': '66547866a8ef9ad0ab0bd782'} AustralianMerino AME IlluminaOvineSNP50 Spain {'$oid': '604f75a61a08c53cebd09b5e'} {'coordinates': [[-6.33333333333333, 39.166666... {'location_source': 'Pedrosa et al. 2007, GSE ... MER2 ESOA-AME-000001533 Ovis aries background NaN NaN NaN
4 {'$oid': '66547867a8ef9ad0ab0bd783'} AustralianMerino AME IlluminaOvineSNP50 Spain {'$oid': '604f75a61a08c53cebd09b5e'} {'coordinates': [[-6.33333333333333, 39.166666... {'location_source': 'Pedrosa et al. 2007, GSE ... MER7 ESOA-AME-000001534 Ovis aries background NaN NaN NaN

These samples seem to come from 50K chip:

sheep["chip_name"].value_counts()
chip_name
IlluminaOvineSNP50    50
Name: count, dtype: int64

Track those samples in a CSV file:

sheep[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "AME_50K.tsv", index=False, header=False, sep="\t")

Now on Île de France breed:

data = sheep_api.get_samples(code="IDF")
page = 1
sheep = pd.DataFrame(data["items"])
sheep.head()
_id breed breed_code chip_name country dataset_id locations metadata original_id smarter_id species type
0 {'$oid': '6654662be91476d662a728e5'} Île de France IDF IlluminaOvineHDSNP France {'$oid': '604f75a61a08c53cebd09b58'} {'coordinates': [[2.42944299999999, 48.801148]... {'link': 'http://en.france-genetique-elevage.o... IDF1 FROA-IDF-000000600 Ovis aries background
1 {'$oid': '6654662fe91476d662a728e6'} Île de France IDF IlluminaOvineHDSNP France {'$oid': '604f75a61a08c53cebd09b58'} {'coordinates': [[2.42944299999999, 48.801148]... {'link': 'http://en.france-genetique-elevage.o... IDF2 FROA-IDF-000000601 Ovis aries background
2 {'$oid': '66546633e91476d662a728e7'} Île de France IDF IlluminaOvineHDSNP France {'$oid': '604f75a61a08c53cebd09b58'} {'coordinates': [[2.42944299999999, 48.801148]... {'link': 'http://en.france-genetique-elevage.o... IDF3 FROA-IDF-000000602 Ovis aries background
3 {'$oid': '66546636e91476d662a728e8'} Île de France IDF IlluminaOvineHDSNP France {'$oid': '604f75a61a08c53cebd09b58'} {'coordinates': [[2.42944299999999, 48.801148]... {'link': 'http://en.france-genetique-elevage.o... IDF4 FROA-IDF-000000603 Ovis aries background
4 {'$oid': '6654663ae91476d662a728e9'} Île de France IDF IlluminaOvineHDSNP France {'$oid': '604f75a61a08c53cebd09b58'} {'coordinates': [[2.42944299999999, 48.801148]... {'link': 'http://en.france-genetique-elevage.o... IDF5 FROA-IDF-000000604 Ovis aries background

These samples seem to come from bot 50k and HD chip:

sheep["chip_name"].value_counts()
chip_name
IlluminaOvineSNP50    27
IlluminaOvineHDSNP    23
Name: count, dtype: int64

Ok take only HD samples:

sheep = sheep[sheep["chip_name"] == "IlluminaOvineHDSNP"]
sheep.info()
<class 'pandas.core.frame.DataFrame'>
Index: 23 entries, 0 to 22
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   _id          23 non-null     object
 1   breed        23 non-null     object
 2   breed_code   23 non-null     object
 3   chip_name    23 non-null     object
 4   country      23 non-null     object
 5   dataset_id   23 non-null     object
 6   locations    23 non-null     object
 7   metadata     23 non-null     object
 8   original_id  23 non-null     object
 9   smarter_id   23 non-null     object
 10  species      23 non-null     object
 11  type         23 non-null     object
dtypes: object(12)
memory usage: 2.3+ KB

Track those samples in a CSV file:

sheep[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "IDF_HD.tsv", index=False, header=False, sep="\t")