Reading data from chromosome 26

Reading data from chromosome 26#

This notebook demonstrates how to read data from TreeSequence objects and perform some basic statistics on the data. We will use the chromosome 26 data from the SMARTER background dataset, calculated with the compara approach

Displaying data
One-way statistics
Multi-way statistics

import tskit
import tsinfer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from tskitetude import get_project_dir

loading data from one chromosome:

ts = tskit.load(get_project_dir() / "results-compara/background_samples/tsinfer/SMARTER-OA-OAR3-forward-0.4.10.focal.26.trees")
ts

Tree Sequence
Trees	496
Sequence Length	44 077 779
Time Units	generations
Sample Nodes	11 478
Total Size	89.9 MiB
Metadata	dict

Table	Rows	Size	Has Metadata
Edges	1 838 362	56.1 MiB
Individuals	5 739	353.1 KiB	✅
Migrations	0	8 Bytes
Mutations	524 747	18.5 MiB
Nodes	19 188	932.4 KiB	✅
Populations	202	4.7 KiB	✅
Provenances	4	2.4 KiB
Sites	657	33.5 KiB	✅

Provenance Timestamp	Software Name	Version	Command	Full record
06 September, 2024 at 02:30:20 AM	tsdate	0.1.7	inside_outside	Details dict schema_version: 1.0.0 software: dict name: tsdate version: 0.1.7 parameters: dict mutation_rate: 1e-08 recombination_rate: None time_units: None progress: None population_size: 10000.0 eps: 1e-06 outside_standardize: True ignore_oldest_root: False probability_space: logarithmic num_threads: None cache_inside: False command: inside_outside environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict tskit: dict version: 0.5.8
06 September, 2024 at 02:25:05 AM	tsdate	0.1.7	preprocess_ts	Details dict schema_version: 1.0.0 software: dict name: tsdate version: 0.1.7 parameters: dict minimum_gap: 1000000 remove_telomeres: True filter_populations: False filter_individuals: False filter_sites: False delete_intervals: list list 0 209048.0 list 44004282.0 44077779.0 command: preprocess_ts environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict tskit: dict version: 0.5.8
06 September, 2024 at 02:25:01 AM	tskit	0.5.8	simplify	Details dict schema_version: 1.0.0 software: dict name: tskit version: 0.5.8 parameters: dict command: simplify TODO: add simplify parameters environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict kastore: dict version: 2.1.1
06 September, 2024 at 02:25:00 AM	tsinfer	0.3.3	infer	Details dict schema_version: 1.0.0 software: dict name: tsinfer version: 0.3.3 parameters: dict mismatch_ratio: None command: infer environment: dict libraries: dict zarr: dict version: 2.17.2 numcodecs: dict version: 0.12.1 lmdb: dict version: 1.4.1 tskit: dict version: 0.5.8 os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: list 3 11 9

To cite this software, please consult the citation manual: https://tskit.dev/citation/

samples = tsinfer.load(get_project_dir() / "results-compara/background_samples/tsinfer/SMARTER-OA-OAR3-forward-0.4.10.focal.26.samples")
print(samples.info)

Name        : /
Type        : zarr.hierarchy.Group
Read-only   : True
Store type  : zarr.storage.LMDBStore
No. members : 5
No. arrays  : 0
No. groups  : 5
Groups      : individuals, populations, provenances, samples, sites
--------------------------------------------------------------------------------
Name               : /individuals/flags
Type               : zarr.core.Array
Data type          : uint32
Shape              : (5739,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 22956 (22.4K)
No. bytes stored   : 389
Storage ratio      : 59.0
Chunks initialized : 6/6
--------------------------------------------------------------------------------
Name               : /individuals/location
Type               : zarr.core.Array
Data type          : object
Shape              : (5739,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Filter [0]         : VLenArray(dtype='<f8')
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 45912 (44.8K)
No. bytes stored   : 656
Storage ratio      : 70.0
Chunks initialized : 6/6
--------------------------------------------------------------------------------
Name               : /individuals/metadata
Type               : zarr.core.Array
Data type          : object
Shape              : (5739,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Filter [0]         : JSON(encoding='utf-8', allow_nan=True, check_circular=True,
                   : ensure_ascii=True,      indent=None, separators=(',', ':'),
                   : skipkeys=False, sort_keys=True,      strict=True)
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 45912 (44.8K)
No. bytes stored   : 8621 (8.4K)
Storage ratio      : 5.3
Chunks initialized : 6/6
--------------------------------------------------------------------------------
Name               : /individuals/population
Type               : zarr.core.Array
Data type          : int32
Shape              : (5739,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 22956 (22.4K)
No. bytes stored   : 1270 (1.2K)
Storage ratio      : 18.1
Chunks initialized : 6/6
--------------------------------------------------------------------------------
Name               : /individuals/time
Type               : zarr.core.Array
Data type          : float64
Shape              : (5739,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 45912 (44.8K)
No. bytes stored   : 391
Storage ratio      : 117.4
Chunks initialized : 6/6
--------------------------------------------------------------------------------
Name               : /populations/metadata
Type               : zarr.core.Array
Data type          : object
Shape              : (202,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Filter [0]         : JSON(encoding='utf-8', allow_nan=True, check_circular=True,
                   : ensure_ascii=True,      indent=None, separators=(',', ':'),
                   : skipkeys=False, sort_keys=True,      strict=True)
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 1616 (1.6K)
No. bytes stored   : 1191 (1.2K)
Storage ratio      : 1.4
Chunks initialized : 1/1
--------------------------------------------------------------------------------
Name               : /provenances/record
Type               : zarr.core.Array
Data type          : object
Shape              : (0,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Filter [0]         : JSON(encoding='utf-8', allow_nan=True, check_circular=True,
                   : ensure_ascii=True,      indent=None, separators=(',', ':'),
                   : skipkeys=False, sort_keys=True,      strict=True)
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 0
No. bytes stored   : 624
Storage ratio      : 0.0
Chunks initialized : 0/0
--------------------------------------------------------------------------------
Name               : /provenances/timestamp
Type               : zarr.core.Array
Data type          : object
Shape              : (0,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Filter [0]         : JSON(encoding='utf-8', allow_nan=True, check_circular=True,
                   : ensure_ascii=True,      indent=None, separators=(',', ':'),
                   : skipkeys=False, sort_keys=True,      strict=True)
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 0
No. bytes stored   : 624
Storage ratio      : 0.0
Chunks initialized : 0/0
--------------------------------------------------------------------------------
Name               : /samples/individual
Type               : zarr.core.Array
Data type          : int32
Shape              : (11478,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 45912 (44.8K)
No. bytes stored   : 6630 (6.5K)
Storage ratio      : 6.9
Chunks initialized : 12/12
--------------------------------------------------------------------------------
Name               : /sites/alleles
Type               : zarr.core.Array
Data type          : object
Shape              : (657,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Filter [0]         : JSON(encoding='utf-8', allow_nan=True, check_circular=True,
                   : ensure_ascii=True,      indent=None, separators=(',', ':'),
                   : skipkeys=False, sort_keys=True,      strict=True)
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 5256 (5.1K)
No. bytes stored   : 1455 (1.4K)
Storage ratio      : 3.6
Chunks initialized : 1/1
--------------------------------------------------------------------------------
Name               : /sites/ancestral_allele
Type               : zarr.core.Array
Data type          : int8
Shape              : (657,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 657
No. bytes stored   : 489
Storage ratio      : 1.3
Chunks initialized : 1/1
--------------------------------------------------------------------------------
Name               : /sites/genotypes
Type               : zarr.core.Array
Data type          : int8
Shape              : (657, 11478)
Chunk shape        : (1024, 1024)
Order              : C
Read-only          : True
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 7541046 (7.2M)
No. bytes stored   : 1305890 (1.2M)
Storage ratio      : 5.8
Chunks initialized : 12/12
--------------------------------------------------------------------------------
Name               : /sites/metadata
Type               : zarr.core.Array
Data type          : object
Shape              : (657,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Filter [0]         : JSON(encoding='utf-8', allow_nan=True, check_circular=True,
                   : ensure_ascii=True,      indent=None, separators=(',', ':'),
                   : skipkeys=False, sort_keys=True,      strict=True)
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 5256 (5.1K)
No. bytes stored   : 696
Storage ratio      : 7.6
Chunks initialized : 1/1
--------------------------------------------------------------------------------
Name               : /sites/position
Type               : zarr.core.Array
Data type          : float64
Shape              : (657,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 5256 (5.1K)
No. bytes stored   : 2420 (2.4K)
Storage ratio      : 2.2
Chunks initialized : 1/1
--------------------------------------------------------------------------------
Name               : /sites/time
Type               : zarr.core.Array
Data type          : float64
Shape              : (657,)
Chunk shape        : (1024,)
Order              : C
Read-only          : True
Compressor         : Zstd(level=1)
Store type         : zarr.storage.LMDBStore
No. bytes          : 5256 (5.1K)
No. bytes stored   : 306
Storage ratio      : 17.2
Chunks initialized : 1/1

/home/cozzip/.cache/pypoetry/virtualenvs/tskitetude-hh-GIRXc-py3.12/lib/python3.12/site-packages/tsinfer/formats.py:104: FutureWarning: The LMDBStore is deprecated and will be removed in a Zarr-Python version 3, see https://github.com/zarr-developers/zarr-python/issues/1274 for more information.
  store = zarr.LMDBStore(

Displaying data#

The first thing I see that was different from tutorial is that the first tree looks a lot different from the second:

first = ts.first()
second = ts.at_index(1)

first

Tree
Index	0
Interval	0-209 049 (209 049)
Roots	11 478
Nodes	11 478
Sites	0
Mutations	0
Total Branch Length	0

second

Tree
Index	1
Interval	209 049-268 822 (59 773)
Roots	1
Nodes	11 530
Sites	1
Mutations	1
Total Branch Length	1 513 559.3

Try to modify visualization tutorial to display the second tree:

x_limits = [int(second.interval[0]), int(second.interval[1])]

# Create evenly-spaced y tick positions to avoid overlap
# y_tick_pos = [0, 1000, 2000, 3000, 4000]
y_tick_pos = np.linspace(0, second.span, num=int(second.span / 10_000))

print("The tree sequence between positions {} and {} ...".format(*x_limits))

# this is a single tree, not a tree sequence (see: https://tskit.dev/tskit/docs/stable/python-api.html#tskit.Tree.draw_svg)
display(
    second.draw_svg(
        y_axis=True,
        y_ticks=y_tick_pos,
        size=(1024, 768),
        time_scale="log_time",
    )
)

WARNING:root:Ticks {np.float64(14943.25): '14943.25', np.float64(29886.5): '29886.50', np.float64(44829.75): '44829.75', np.float64(59773.0): '59773.00'} lie outside the plotted axis

The tree sequence between positions 209049 and 268822 ...

../_images/0be84734a4722a97df13bb4ea0840e2aa8913c61bc31dd20cab8e00a10b0f1dd.svg

print("A larger tree, on a log timescale")
wide_fmt = (250000, 2000)

# Create a stylesheet that shrinks labels and rotates leaf labels, to avoid overlap
node_label_style = (
    ".node > .lab {font-size: 80%}"
    ".leaf > .lab {text-anchor: start; transform: rotate(90deg) translate(6px)}"
)
second.draw_svg(
    size=wide_fmt,
    time_scale="log_time",
    y_gridlines=True,
    y_axis=True,
    y_ticks=[1, 10, 100, 1000],
    style=node_label_style,
)

WARNING:root:Ticks {1000: '1000'} lie outside the plotted axis

A larger tree, on a log timescale

../_images/06907b919482c61a7e96035a689e01a9ce10f5c633dfab7635a06fa3cd26f7a4.svg

site-frequency spectrum#

Calculate site-frequency spectrum:

plt.bar(np.array(range(11)), ts.simplify(range(10)).allele_frequency_spectrum(polarised=True, span_normalise=True))
plt.show()

../_images/10b7e3972db3d629276d31921708d1749907bb6129f58959face0afc59a045d2.png

One-way statistics#

We refer to statistics that are defined with respect to a single set of samples as “one-way”. An example of such a statistic is diversity, which is computed using the TreeSequence.diversity() method:

pi = ts.diversity()
print(f"Average diversity per unit sequence length = {pi:.3G}")

Average diversity per unit sequence length = 5.96E-06

Computes mean genetic diversity (also known as pi) in each of the sets of nodes from sample_sets. The statistic is also known as sample heterozygosity; a common citation for the definition is Nei and Li (1979) (equation 22), so it is sometimes called called “Nei’s pi” (but also sometimes “Tajima’s pi”). This tells the average diversity across the whole sequence and returns a single number. We’ll usually want to compute statistics in windows along the genome and we use the windows argument to do this:

print("Sequence length = ", ts.sequence_length)
# windows = np.linspace(0, ts.sequence_length, num=int(ts.sequence_length / 1_000_000) + 1)
# it's seems that windows needs to contain the initial and final positions
windows = np.append(np.arange(0, ts.sequence_length, 5_000_000), ts.sequence_length)
# transform into integer
windows = windows.astype(int)
pi = ts.diversity(windows=windows)
df = pd.DataFrame({"windows": windows[1:], "pi": pi})
df["pi"] = df["pi"].map(lambda x: f"{x:.3G}")
df

Sequence length =  44077779.0

	windows	pi
0	5000000	6.08E-06
1	10000000	5.5E-06
2	15000000	5.55E-06
3	20000000	5.94E-06
4	25000000	6E-06
5	30000000	6.54E-06
6	35000000	5.27E-06
7	40000000	6.08E-06
8	44077779	6.86E-06

Suppose we wanted to compute diversity within a specific subset of samples. We can do this using the sample_sets argument:

A = ts.samples()[:100]
d = ts.diversity(sample_sets=A)
print(d)

5.692838488247723e-06

Here, we’ve computed the average diversity within the first hundred samples across the whole genome. As we’ve not specified any windows, this is again a single value. We can also compute diversity in multiple sample sets at the same time by providing a list of sample sets as an argument:

A = ts.samples()[:100]
B = ts.samples()[100:200]
C = ts.samples()[200:300]
d = ts.diversity(sample_sets=[A, B, C])
print(d)

[5.69283849e-06 5.64950370e-06 5.67518833e-06]

Ok, this was done by following the tutorial an getting samples by indexes. But can I select my data by breeds? this information seems not to be stored in tstree object itself, but in the sample data I used to generate my stuff. Let’s discover the samples by breed. Remember that in my data I have 11477 samples, which stand for a pair of chromosomes for my 5739 individuals:

print(f"I have {ts.samples().size} samples")
print(f"which stand for {sum(1 for _ in samples.individuals())} individuals")

I have 11478 samples
which stand for 5739 individuals

def get_sample_indexes(samples: tsinfer.SampleData, breed: str):
    # return np.where(samples.individuals_metadata["breed"] == breed)[0]
    # get breed index from samples.population_metadata
    breed_idx = next((index for index, d in enumerate(samples.populations_metadata) if d['breed'] == breed), None)

    # get individuals indexes by breed index
    individuals = [i.id for i in filter(lambda i: i.population == breed_idx, samples.individuals())]

    # get samples by individual index
    samples = [s.id for s in filter(lambda s: s.individual in individuals, samples.samples())]

    return samples

Get indexes for MER and TEX and calculate diversity:

TEX = get_sample_indexes(samples, "TEX")
MER = get_sample_indexes(samples, "MER")

pi = ts.diversity(sample_sets=[TEX, MER])
print(pi)

[5.83822688e-06 5.82979559e-06]

Same stuff as before but using windows:

windows = np.append(np.arange(0, ts.sequence_length, 5_000_000), ts.sequence_length)
windows = windows.astype(int)
pi = ts.diversity(sample_sets=[TEX, MER], windows=windows)
df = pd.DataFrame({"windows": windows[1:], "MER_pi": pi[:, 0], "TEX_pi": pi[:, 1]})
df["MER_pi"] = df["MER_pi"].map(lambda x: f"{x:.3G}")
df["TEX_pi"] = df["TEX_pi"].map(lambda x: f"{x:.3G}")
df

	windows	MER_pi	TEX_pi
0	5000000	6.09E-06	6.12E-06
1	10000000	5.38E-06	5.37E-06
2	15000000	5.49E-06	5.35E-06
3	20000000	5.55E-06	5.74E-06
4	25000000	5.94E-06	5.88E-06
5	30000000	6.28E-06	6.39E-06
6	35000000	5.26E-06	5.28E-06
7	40000000	5.99E-06	5.91E-06
8	44077779	6.72E-06	6.54E-06

Multi-way statistics#

Many population genetic statistics compare multiple sets of samples to each other. For example, the TreeSequence.divergence() method computes the divergence between two subsets of samples:

d = ts.divergence([TEX, MER])
print(f"Divergence between TEX and MER: {d:.3G}")

Divergence between TEX and MER: 5.95E-06

The divergence between two sets of samples is a single number, and we we again return a single floating point value as the result. We can also compute this in windows along the genome, as before:

d = ts.divergence([TEX, MER], windows=windows)
print(d)

[6.21551166e-06 5.47266300e-06 5.56901446e-06 5.76647237e-06
 6.07222582e-06 6.44783571e-06 5.35874352e-06 6.06778929e-06
 6.77429612e-06]

A powerful feature of tskit’s stats API is that we can compute the divergences between multiple sets of samples simultaneously using the indexes argument:

CRL = get_sample_indexes(samples, "CRL")

d = ts.divergence([TEX, MER, CRL], indexes=[(0, 1), (0, 2)])
print(d)

[5.95482276e-06 5.96852516e-06]

The indexes argument is used to specify which pairs of sets we are interested in. In this example we’ve computed two different divergence values and the output is therefore a numpy array of length 2.

As before, we can combine computing multiple statistics in multiple windows to return a 2D numpy array:

d = ts.divergence([TEX, MER, CRL], indexes=[(0, 1), (0, 2)], windows=windows)
df = pd.DataFrame({"windows": windows[1:], "TEXvsMER": d[:, 0], "MERvsCRL": d[:, 1]})
for column in df.columns[1:]:
    df[column] = df[column].map(lambda x: f"{x:.3G}")
df

	windows	TEXvsMER	MERvsCRL
0	5000000	6.22E-06	6.21E-06
1	10000000	5.47E-06	5.54E-06
2	15000000	5.57E-06	5.63E-06
3	20000000	5.77E-06	5.64E-06
4	25000000	6.07E-06	6.01E-06
5	30000000	6.45E-06	6.47E-06
6	35000000	5.36E-06	5.44E-06
7	40000000	6.07E-06	6.09E-06
8	44077779	6.77E-06	6.84E-06

Each row again corresponds to a window, which contains the average divergence values between the chosen sets.