Infer ancestry using EST-SFS#
This was the first attempt to create tstree object using EST-SFS. In this example
we collect all background samples from SMARTER database and we use all Ovis
aries samples as focal samples, and european, sardinian and spanish mouflon
as three different outgroups to make inference with EST-SFS. We will write all
selection of samples as CSV files with FID and IID columns in order to extract
from the whole genotype files only the samples we need.
Try to collect sheep background samples from SMARTER database:
import pandas as pd
from tskitetude import get_data_dir
from tskitetude.smarterapi import SheepEndpoint, BreedEndpoint
Connect to SMARTER database and retrieve information on background samples:
sheep_api = SheepEndpoint()
data = sheep_api.get_samples(_type="background")
page = 1
sheep = pd.DataFrame(data["items"])
while data["next"] is not None:
data = sheep_api.get_samples(page=page+1, _type="background")
df_page = pd.DataFrame(data["items"])
page = data["page"]
sheep = pd.concat([sheep, df_page], ignore_index=True)
sheep.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5892 entries, 0 to 5891
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 _id 5892 non-null object
1 breed 5892 non-null object
2 breed_code 5892 non-null object
3 chip_name 5892 non-null object
4 country 5892 non-null object
5 dataset_id 5892 non-null object
6 locations 4849 non-null object
7 metadata 4878 non-null object
8 original_id 5892 non-null object
9 phenotype 577 non-null object
10 smarter_id 5892 non-null object
11 species 5892 non-null object
12 type 5892 non-null object
13 father_id 58 non-null object
14 mother_id 58 non-null object
15 sex 501 non-null float64
16 alias 156 non-null object
dtypes: float64(1), object(16)
memory usage: 782.7+ KB
Are those all background samples?
sheep.value_counts("type")
type
background 5892
Name: count, dtype: int64
Ok. Let’s collect all available species:
sheep.value_counts("species")
species
Ovis aries 5739
Ovis aries musimon 134
Ovis orientalis 16
Ovis orientalis ophion 3
Name: count, dtype: int64
Ok, now collect all samples which are Ovis aries:
ovis_aries = sheep[sheep["species"] == "Ovis aries"]
ovis_aries.head()
| _id | breed | breed_code | chip_name | country | dataset_id | locations | metadata | original_id | phenotype | smarter_id | species | type | father_id | mother_id | sex | alias | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | {'$oid': '66544dd515d202ad5f28dc8e'} | Texel | TEX | IlluminaOvineSNP50 | Uruguay | {'$oid': '604f75a61a08c53cebd09b67'} | {'coordinates': [[-54.79980235632396, -32.8596... | {'gps_2': 'https://www.google.com/maps/place/E... | 20181210002 | {'purpose': 'Meat'} | UYOA-TEX-000000001 | Ovis aries | background | NaN | NaN | NaN | NaN |
| 1 | {'$oid': '66544dd615d202ad5f28dc8f'} | Texel | TEX | IlluminaOvineSNP50 | Uruguay | {'$oid': '604f75a61a08c53cebd09b67'} | {'coordinates': [[-54.79980235632396, -32.8596... | {'gps_2': 'https://www.google.com/maps/place/E... | 20181210003 | {'purpose': 'Meat'} | UYOA-TEX-000000002 | Ovis aries | background | NaN | NaN | NaN | NaN |
| 2 | {'$oid': '66544dd615d202ad5f28dc90'} | Texel | TEX | IlluminaOvineSNP50 | Uruguay | {'$oid': '604f75a61a08c53cebd09b67'} | {'coordinates': [[-54.79980235632396, -32.8596... | {'gps_2': 'https://www.google.com/maps/place/E... | 20181210005 | {'purpose': 'Meat'} | UYOA-TEX-000000003 | Ovis aries | background | NaN | NaN | NaN | NaN |
| 3 | {'$oid': '66544dd715d202ad5f28dc91'} | Texel | TEX | IlluminaOvineSNP50 | Uruguay | {'$oid': '604f75a61a08c53cebd09b67'} | {'coordinates': [[-54.79980235632396, -32.8596... | {'gps_2': 'https://www.google.com/maps/place/E... | 20181210006 | {'purpose': 'Meat'} | UYOA-TEX-000000004 | Ovis aries | background | NaN | NaN | NaN | NaN |
| 4 | {'$oid': '66544dd715d202ad5f28dc92'} | Texel | TEX | IlluminaOvineSNP50 | Uruguay | {'$oid': '604f75a61a08c53cebd09b67'} | {'coordinates': [[-54.79980235632396, -32.8596... | {'gps_2': 'https://www.google.com/maps/place/E... | 20181210008 | {'purpose': 'Meat'} | UYOA-TEX-000000005 | Ovis aries | background | NaN | NaN | NaN | NaN |
How many breeds I have?
ovis_aries.value_counts("breed")
breed
Frizarta 322
Corriedale 214
Texel 193
Churra 120
Soay 110
...
Spael-coloured 3
Dubska 2
Recka 2
Privorska 2
Sri Lankan 1
Name: count, Length: 202, dtype: int64
Ensure that there are no mouflon in sheep breed names:
ovis_aries["breed"].str.contains("Mouflon", case=False).any()
np.False_
Ok, now collect Ovis aries musimon samples:
ovis_aries_musimon = sheep[sheep["species"] == "Ovis aries musimon"]
ovis_aries_musimon.head()
| _id | breed | breed_code | chip_name | country | dataset_id | locations | metadata | original_id | phenotype | smarter_id | species | type | father_id | mother_id | sex | alias | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 788 | {'$oid': '665468ffe91476d662a729a2'} | European mouflon | EUR | IlluminaOvineHDSNP | France | {'$oid': '604f75a61a08c53cebd09b58'} | {'coordinates': [[9.0129, 42.0396]], 'type': '... | {'link': 'https://en.wikipedia.org/wiki/Mouflo... | EUR1 | NaN | FROA-EUR-000000789 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 789 | {'$oid': '66546903e91476d662a729a3'} | European mouflon | EUR | IlluminaOvineHDSNP | France | {'$oid': '604f75a61a08c53cebd09b58'} | {'coordinates': [[9.0129, 42.0396]], 'type': '... | {'link': 'https://en.wikipedia.org/wiki/Mouflo... | EUR2 | NaN | FROA-EUR-000000790 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 4849 | {'$oid': '66549c5268b308e4e558b2e1'} | Spanish mouflon | MSP | IlluminaOvineSNP50 | Spain | {'$oid': '639b4b922cee7f5ed2014011'} | NaN | NaN | EMF1 | NaN | ESOA-MSP-000010795 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 4850 | {'$oid': '66549c5268b308e4e558b2e2'} | Spanish mouflon | MSP | IlluminaOvineSNP50 | Spain | {'$oid': '639b4b922cee7f5ed2014011'} | NaN | NaN | EMF2 | NaN | ESOA-MSP-000010796 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 4851 | {'$oid': '66549c5268b308e4e558b2e3'} | Spanish mouflon | MSP | IlluminaOvineSNP50 | Spain | {'$oid': '639b4b922cee7f5ed2014011'} | NaN | NaN | EMF3 | NaN | ESOA-MSP-000010797 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
How many breeds I have?
ovis_aries_musimon.value_counts("breed")
breed
Sardinian mouflon 79
European mouflon 23
Spanish mouflon 21
Hungarian mouflon 8
Corsican mouflon 3
Name: count, dtype: int64
Ok, try to collect European mouflon:
european_mouflon = ovis_aries_musimon[ovis_aries_musimon["breed"] == "European mouflon"]
european_mouflon.head()
| _id | breed | breed_code | chip_name | country | dataset_id | locations | metadata | original_id | phenotype | smarter_id | species | type | father_id | mother_id | sex | alias | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 788 | {'$oid': '665468ffe91476d662a729a2'} | European mouflon | EUR | IlluminaOvineHDSNP | France | {'$oid': '604f75a61a08c53cebd09b58'} | {'coordinates': [[9.0129, 42.0396]], 'type': '... | {'link': 'https://en.wikipedia.org/wiki/Mouflo... | EUR1 | NaN | FROA-EUR-000000789 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 789 | {'$oid': '66546903e91476d662a729a3'} | European mouflon | EUR | IlluminaOvineHDSNP | France | {'$oid': '604f75a61a08c53cebd09b58'} | {'coordinates': [[9.0129, 42.0396]], 'type': '... | {'link': 'https://en.wikipedia.org/wiki/Mouflo... | EUR2 | NaN | FROA-EUR-000000790 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 5739 | {'$oid': '66549c6eb20eb5227329c2ca'} | European mouflon | EUR | IlluminaOvineSNP50 | France | {'$oid': '639b4b922cee7f5ed2014012'} | {'coordinates': [[9.15, 42.31]], 'type': 'Mult... | {'region': 'Corsica', 'type_': 'Mouflon'} | EMF1 | NaN | FROA-EUR-000011685 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 5740 | {'$oid': '66549c6eb20eb5227329c2cb'} | European mouflon | EUR | IlluminaOvineSNP50 | France | {'$oid': '639b4b922cee7f5ed2014012'} | {'coordinates': [[9.15, 42.31]], 'type': 'Mult... | {'region': 'Corsica', 'type_': 'Mouflon'} | EMF2 | NaN | FROA-EUR-000011686 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 5741 | {'$oid': '66549c6eb20eb5227329c2cc'} | European mouflon | EUR | IlluminaOvineSNP50 | France | {'$oid': '639b4b922cee7f5ed2014012'} | {'coordinates': [[9.15, 42.31]], 'type': 'Mult... | {'region': 'Corsica', 'type_': 'Mouflon'} | EMF3 | NaN | FROA-EUR-000011687 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
Ok, I’m also interested in Sardinian mouflon:
sardinian_mouflon = ovis_aries_musimon[ovis_aries_musimon["breed"] == "Sardinian mouflon"]
sardinian_mouflon.head()
| _id | breed | breed_code | chip_name | country | dataset_id | locations | metadata | original_id | phenotype | smarter_id | species | type | father_id | mother_id | sex | alias | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4881 | {'$oid': '66549c5368b308e4e558b301'} | Sardinian mouflon | MSA | IlluminaOvineSNP50 | Italy | {'$oid': '639b4b922cee7f5ed2014011'} | NaN | NaN | 72_MufloneS | NaN | ITOA-MSA-000010827 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 4882 | {'$oid': '66549c5368b308e4e558b302'} | Sardinian mouflon | MSA | IlluminaOvineSNP50 | Italy | {'$oid': '639b4b922cee7f5ed2014011'} | NaN | NaN | 73_MufloneS | NaN | ITOA-MSA-000010828 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 4883 | {'$oid': '66549c5368b308e4e558b303'} | Sardinian mouflon | MSA | IlluminaOvineSNP50 | Italy | {'$oid': '639b4b922cee7f5ed2014011'} | NaN | NaN | 74_MufloneS | NaN | ITOA-MSA-000010829 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 4884 | {'$oid': '66549c5368b308e4e558b304'} | Sardinian mouflon | MSA | IlluminaOvineSNP50 | Italy | {'$oid': '639b4b922cee7f5ed2014011'} | NaN | NaN | 75_MufloneS | NaN | ITOA-MSA-000010830 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 4885 | {'$oid': '66549c5368b308e4e558b305'} | Sardinian mouflon | MSA | IlluminaOvineSNP50 | Italy | {'$oid': '639b4b922cee7f5ed2014011'} | NaN | NaN | 76_MufloneS | NaN | ITOA-MSA-000010831 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
Should I take Spanish mouflon as third outgroup?
spanish_mouflon = ovis_aries_musimon[ovis_aries_musimon["breed"] == "Spanish mouflon"]
spanish_mouflon.head()
| _id | breed | breed_code | chip_name | country | dataset_id | locations | metadata | original_id | phenotype | smarter_id | species | type | father_id | mother_id | sex | alias | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4849 | {'$oid': '66549c5268b308e4e558b2e1'} | Spanish mouflon | MSP | IlluminaOvineSNP50 | Spain | {'$oid': '639b4b922cee7f5ed2014011'} | NaN | NaN | EMF1 | NaN | ESOA-MSP-000010795 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 4850 | {'$oid': '66549c5268b308e4e558b2e2'} | Spanish mouflon | MSP | IlluminaOvineSNP50 | Spain | {'$oid': '639b4b922cee7f5ed2014011'} | NaN | NaN | EMF2 | NaN | ESOA-MSP-000010796 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 4851 | {'$oid': '66549c5268b308e4e558b2e3'} | Spanish mouflon | MSP | IlluminaOvineSNP50 | Spain | {'$oid': '639b4b922cee7f5ed2014011'} | NaN | NaN | EMF3 | NaN | ESOA-MSP-000010797 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 4852 | {'$oid': '66549c5268b308e4e558b2e4'} | Spanish mouflon | MSP | IlluminaOvineSNP50 | Spain | {'$oid': '639b4b922cee7f5ed2014011'} | NaN | NaN | EMF4 | NaN | ESOA-MSP-000010798 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
| 4853 | {'$oid': '66549c5268b308e4e558b2e5'} | Spanish mouflon | MSP | IlluminaOvineSNP50 | Spain | {'$oid': '639b4b922cee7f5ed2014011'} | NaN | NaN | EMF5 | NaN | ESOA-MSP-000010799 | Ovis aries musimon | background | NaN | NaN | NaN | NaN |
Ok, now track those breeds as three different outgroup list:
european_mouflon[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "european_mouflon.tsv", index=False, header=False, sep="\t")
sardinian_mouflon[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "sardinian_mouflon.tsv", index=False, header=False, sep="\t")
spanish_mouflon[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "spanish_mouflon.tsv", index=False, header=False, sep="\t")
Now, create a sample txt file which I can use to extract the focal sample I need from smarter database using plink:
ovis_aries[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "sheep_dataset.tsv", index=False, header=False, sep="\t")
Attempt to limit sample size#
Ok try to download a small dataset to test the pipeline: get information about breeds:
breed_api = BreedEndpoint()
data = breed_api.get_breeds(species="Sheep")
page = 1
breeds = pd.DataFrame(data["items"])
while data["next"] is not None:
data = breed_api.get_breeds(page=page+1, species="Sheep")
df_page = pd.DataFrame(data["items"])
page = data["page"]
breeds = pd.concat([breeds, df_page], ignore_index=True)
breeds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 285 entries, 0 to 284
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 _id 285 non-null object
1 aliases 285 non-null object
2 code 285 non-null object
3 n_individuals 285 non-null int64
4 name 285 non-null object
5 species 285 non-null object
dtypes: int64(1), object(5)
memory usage: 13.5+ KB
Try to select samples with a limited number of individuals, for example 50:
breeds[breeds["n_individuals"] == 50]
| _id | aliases | code | n_individuals | name | species | |
|---|---|---|---|---|---|---|
| 30 | {'$oid': '66544d66332b746be8c662d3'} | [{'country': 'France', 'dataset_id': {'$oid': ... | IDF | 50 | Île de France | Sheep |
| 54 | {'$oid': '66544d67682777165324011b'} | [{'country': 'Spain', 'dataset_id': {'$oid': '... | AME | 50 | AustralianMerino | Sheep |
Ok first focus on AustralianMerino breed:
data = sheep_api.get_samples(code="AME")
page = 1
sheep = pd.DataFrame(data["items"])
sheep.head()
| _id | breed | breed_code | chip_name | country | dataset_id | locations | metadata | original_id | smarter_id | species | type | sex | father_id | mother_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | {'$oid': '66547865a8ef9ad0ab0bd77f'} | AustralianMerino | AME | IlluminaOvineSNP50 | Spain | {'$oid': '604f75a61a08c53cebd09b5e'} | {'coordinates': [[-6.33333333333333, 39.166666... | {'location_source': 'Pedrosa et al. 2007, GSE ... | MER31 | ESOA-AME-000001530 | Ovis aries | background | NaN | NaN | NaN |
| 1 | {'$oid': '66547866a8ef9ad0ab0bd780'} | AustralianMerino | AME | IlluminaOvineSNP50 | Spain | {'$oid': '604f75a61a08c53cebd09b5e'} | {'coordinates': [[-6.33333333333333, 39.166666... | {'location_source': 'Pedrosa et al. 2007, GSE ... | MER47 | ESOA-AME-000001531 | Ovis aries | background | 2.0 | NaN | NaN |
| 2 | {'$oid': '66547866a8ef9ad0ab0bd781'} | AustralianMerino | AME | IlluminaOvineSNP50 | Spain | {'$oid': '604f75a61a08c53cebd09b5e'} | {'coordinates': [[-6.33333333333333, 39.166666... | {'location_source': 'Pedrosa et al. 2007, GSE ... | MER1 | ESOA-AME-000001532 | Ovis aries | background | NaN | NaN | NaN |
| 3 | {'$oid': '66547866a8ef9ad0ab0bd782'} | AustralianMerino | AME | IlluminaOvineSNP50 | Spain | {'$oid': '604f75a61a08c53cebd09b5e'} | {'coordinates': [[-6.33333333333333, 39.166666... | {'location_source': 'Pedrosa et al. 2007, GSE ... | MER2 | ESOA-AME-000001533 | Ovis aries | background | NaN | NaN | NaN |
| 4 | {'$oid': '66547867a8ef9ad0ab0bd783'} | AustralianMerino | AME | IlluminaOvineSNP50 | Spain | {'$oid': '604f75a61a08c53cebd09b5e'} | {'coordinates': [[-6.33333333333333, 39.166666... | {'location_source': 'Pedrosa et al. 2007, GSE ... | MER7 | ESOA-AME-000001534 | Ovis aries | background | NaN | NaN | NaN |
These samples seem to come from 50K chip:
sheep["chip_name"].value_counts()
chip_name
IlluminaOvineSNP50 50
Name: count, dtype: int64
Track those samples in a CSV file:
sheep[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "AME_50K.tsv", index=False, header=False, sep="\t")
Now on Île de France breed:
data = sheep_api.get_samples(code="IDF")
page = 1
sheep = pd.DataFrame(data["items"])
sheep.head()
| _id | breed | breed_code | chip_name | country | dataset_id | locations | metadata | original_id | smarter_id | species | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | {'$oid': '6654662be91476d662a728e5'} | Île de France | IDF | IlluminaOvineHDSNP | France | {'$oid': '604f75a61a08c53cebd09b58'} | {'coordinates': [[2.42944299999999, 48.801148]... | {'link': 'http://en.france-genetique-elevage.o... | IDF1 | FROA-IDF-000000600 | Ovis aries | background |
| 1 | {'$oid': '6654662fe91476d662a728e6'} | Île de France | IDF | IlluminaOvineHDSNP | France | {'$oid': '604f75a61a08c53cebd09b58'} | {'coordinates': [[2.42944299999999, 48.801148]... | {'link': 'http://en.france-genetique-elevage.o... | IDF2 | FROA-IDF-000000601 | Ovis aries | background |
| 2 | {'$oid': '66546633e91476d662a728e7'} | Île de France | IDF | IlluminaOvineHDSNP | France | {'$oid': '604f75a61a08c53cebd09b58'} | {'coordinates': [[2.42944299999999, 48.801148]... | {'link': 'http://en.france-genetique-elevage.o... | IDF3 | FROA-IDF-000000602 | Ovis aries | background |
| 3 | {'$oid': '66546636e91476d662a728e8'} | Île de France | IDF | IlluminaOvineHDSNP | France | {'$oid': '604f75a61a08c53cebd09b58'} | {'coordinates': [[2.42944299999999, 48.801148]... | {'link': 'http://en.france-genetique-elevage.o... | IDF4 | FROA-IDF-000000603 | Ovis aries | background |
| 4 | {'$oid': '6654663ae91476d662a728e9'} | Île de France | IDF | IlluminaOvineHDSNP | France | {'$oid': '604f75a61a08c53cebd09b58'} | {'coordinates': [[2.42944299999999, 48.801148]... | {'link': 'http://en.france-genetique-elevage.o... | IDF5 | FROA-IDF-000000604 | Ovis aries | background |
These samples seem to come from bot 50k and HD chip:
sheep["chip_name"].value_counts()
chip_name
IlluminaOvineSNP50 27
IlluminaOvineHDSNP 23
Name: count, dtype: int64
Ok take only HD samples:
sheep = sheep[sheep["chip_name"] == "IlluminaOvineHDSNP"]
sheep.info()
<class 'pandas.core.frame.DataFrame'>
Index: 23 entries, 0 to 22
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 _id 23 non-null object
1 breed 23 non-null object
2 breed_code 23 non-null object
3 chip_name 23 non-null object
4 country 23 non-null object
5 dataset_id 23 non-null object
6 locations 23 non-null object
7 metadata 23 non-null object
8 original_id 23 non-null object
9 smarter_id 23 non-null object
10 species 23 non-null object
11 type 23 non-null object
dtypes: object(12)
memory usage: 2.3+ KB
Track those samples in a CSV file:
sheep[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "IDF_HD.tsv", index=False, header=False, sep="\t")