SMARTER 50K

SMARTER 50K#

This is an attempt to test tstree with different 50K data. A single example of 50 samples is described in notebooks/03-smarter_database.ipynb. Here we want to select different breeds from 50K and test how the tstree will results

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tskitetude import get_data_dir
from tskitetude.smarterapi import SheepEndpoint

Collect all sheep samples from 50K:

sheep_api = SheepEndpoint()

data = sheep_api.get_samples(chip_name="IlluminaOvineSNP50")
page = page = data["page"]
sheep = pd.DataFrame(data["items"])

while data["next"] is not None:
    data = sheep_api.get_samples(page=page+1, _type="background")
    df_page = pd.DataFrame(data["items"])
    page = data["page"]
    sheep = pd.concat([sheep, df_page], ignore_index=True)

sheep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5892 entries, 0 to 5891
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 _id          5892 non-null   object 
 breed        5892 non-null   object 
 breed_code   5892 non-null   object 
 chip_name    5892 non-null   object 
 country      5892 non-null   object 
 dataset_id   5892 non-null   object 
 locations    4849 non-null   object 
 metadata     4878 non-null   object 
 original_id  5892 non-null   object 
 phenotype    577 non-null    object 
smarter_id   5892 non-null   object 
species      5892 non-null   object 
type         5892 non-null   object 
father_id    58 non-null     object 
mother_id    58 non-null     object 
sex          501 non-null    float64
alias        156 non-null    object 
dtypes: float64(1), object(16)
memory usage: 782.7+ KB

Count how many samples I have by breed:

sheep_count = sheep.groupby('breed_code').size().reset_index(name='count')
sheep_count

	breed_code	count
0	AAW	24
1	ADP	21
2	AFS	37
3	AIM	88
4	ALP	24
...	...	...
204	WIL	23
205	WMH	24
206	XIS	24
207	ZEL	24
208	ZUJ	5

209 rows × 2 columns

Plot count distribution:

fig, ax = plt.subplots(figsize=(12, 6))
sns.histplot(data=sheep_count, x='count', kde=True, binwidth=10, ax=ax)
plt.show()

../_images/6a6e7ab23827eace76584676f78ccc6d65b3d09e2c31450fe959508016914595.png

The majority of the breed have <= 50 samples. Try to focus to the higher column, for example where counts > 20 and <=30:

selected = sheep_count[(sheep_count['count'] <= 30) & (sheep_count['count'] > 20)]
print(f"There are {selected['count'].shape[0]} breeds having samples between 20 and 30")

There are 84 breeds having samples between 20 and 30

Now extract randomly a certain number of breeds and make a CSV file that I can use to extract genotypes using nf-treeseq pipeline:

np.random.seed(42)
repeats = 5
breed_size = [1, 2, 5, 10, 15, 20]

for i in breed_size:
    for j in range(repeats):
        sample = selected['breed_code'].sample(n=i, replace=False)
        filename = f"{i}_breeds-{j}-50K.csv"
        sample_df = sheep[sheep['breed_code'].isin(sample)]
        sample_df[["breed_code", "smarter_id"]].to_csv(get_data_dir() / filename, index=False, header=False, sep="\t")

	breed_code	count
0	AAW	24
1	ADP	21
2	AFS	37
3	AIM	88
4	ALP	24
...	...	...
204	WIL	23
205	WMH	24
206	XIS	24
207	ZEL	24
208	ZUJ	5

	breed_code	count
0	AAW	24
1	ADP	21
2	AFS	37
3	AIM	88
4	ALP	24
...	...	...
204	WIL	23
205	WMH	24
206	XIS	24
207	ZEL	24
208	ZUJ	5

	breed_code	count
0	AAW	24
1	ADP	21
2	AFS	37
3	AIM	88
4	ALP	24
...	...	...
204	WIL	23
205	WMH	24
206	XIS	24
207	ZEL	24
208	ZUJ	5