SMARTER 50K

SMARTER 50K#

This is an attempt to test tstree with different 50K data. A single example of 50 samples is described in notebooks/03-smarter_database.ipynb. Here we want to select different breeds from 50K and test how the tstree will results

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tskitetude import get_data_dir
from tskitetude.smarterapi import SheepEndpoint

Collect all sheep samples from 50K:

sheep_api = SheepEndpoint()

data = sheep_api.get_samples(chip_name="IlluminaOvineSNP50")
page = page = data["page"]
sheep = pd.DataFrame(data["items"])

while data["next"] is not None:
    data = sheep_api.get_samples(page=page+1, _type="background")
    df_page = pd.DataFrame(data["items"])
    page = data["page"]
    sheep = pd.concat([sheep, df_page], ignore_index=True)

sheep.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5892 entries, 0 to 5891
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   _id          5892 non-null   object 
 1   breed        5892 non-null   object 
 2   breed_code   5892 non-null   object 
 3   chip_name    5892 non-null   object 
 4   country      5892 non-null   object 
 5   dataset_id   5892 non-null   object 
 6   locations    4849 non-null   object 
 7   metadata     4878 non-null   object 
 8   original_id  5892 non-null   object 
 9   phenotype    577 non-null    object 
 10  smarter_id   5892 non-null   object 
 11  species      5892 non-null   object 
 12  type         5892 non-null   object 
 13  father_id    58 non-null     object 
 14  mother_id    58 non-null     object 
 15  sex          501 non-null    float64
 16  alias        156 non-null    object 
dtypes: float64(1), object(16)
memory usage: 782.7+ KB

Count how many samples I have by breed:

sheep_count = sheep.groupby('breed_code').size().reset_index(name='count')
sheep_count
breed_code count
0 AAW 24
1 ADP 21
2 AFS 37
3 AIM 88
4 ALP 24
... ... ...
204 WIL 23
205 WMH 24
206 XIS 24
207 ZEL 24
208 ZUJ 5

209 rows × 2 columns

Plot count distribution:

fig, ax = plt.subplots(figsize=(12, 6))
sns.histplot(data=sheep_count, x='count', kde=True, binwidth=10, ax=ax)
plt.show()
../_images/6a6e7ab23827eace76584676f78ccc6d65b3d09e2c31450fe959508016914595.png

The majority of the breed have <= 50 samples. Try to focus to the higher column, for example where counts > 20 and <=30:

selected = sheep_count[(sheep_count['count'] <= 30) & (sheep_count['count'] > 20)]
print(f"There are {selected['count'].shape[0]} breeds having samples between 20 and 30")
There are 84 breeds having samples between 20 and 30

Now extract randomly a certain number of breeds and make a CSV file that I can use to extract genotypes using nf-treeseq pipeline:

np.random.seed(42)
repeats = 5
breed_size = [1, 2, 5, 10, 15, 20]

for i in breed_size:
    for j in range(repeats):
        sample = selected['breed_code'].sample(n=i, replace=False)
        filename = f"{i}_breeds-{j}-50K.csv"
        sample_df = sheep[sheep['breed_code'].isin(sample)]
        sample_df[["breed_code", "smarter_id"]].to_csv(get_data_dir() / filename, index=False, header=False, sep="\t")