About tstree space

About tstree space#

A tstree object created from the smarter dataset occupies a lot of space. Even by providing REF allele as ancestor alleles and dropping mutation in tstree object there is still a lot of space occupied by the object. So try to deal with tables and inspect how much data is stored here

from pprint import pprint
import numpy as np

import tskit

from tskitetude import get_project_dir

ts = tskit.load(str(get_project_dir() / "results-reference/background_samples/tsinfer/SMARTER-OA-OAR3-forward-0.4.10.focal.26.trees"))
ts

Tree Sequence
Trees	659
Sequence Length	44 077 779
Time Units	generations
Sample Nodes	11 478
Total Size	87.9 MiB
Metadata	dict

Table	Rows	Size	Has Metadata
Edges	2 266 939	69.2 MiB
Individuals	5 739	353.1 KiB	✅
Migrations	0	8 Bytes
Mutations	657	23.8 KiB
Nodes	21 091	1.1 MiB	✅
Populations	202	4.7 KiB	✅
Provenances	4	2.4 KiB
Sites	657	32.7 KiB	✅

Provenance Timestamp	Software Name	Version	Command	Full record
06 September, 2024 at 01:19:45 AM	tsdate	0.1.7	inside_outside	Details dict schema_version: 1.0.0 software: dict name: tsdate version: 0.1.7 parameters: dict mutation_rate: 1e-08 recombination_rate: None time_units: None progress: None population_size: 10000.0 eps: 1e-06 outside_standardize: True ignore_oldest_root: False probability_space: logarithmic num_threads: None cache_inside: False command: inside_outside environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict tskit: dict version: 0.5.8
06 September, 2024 at 01:13:11 AM	tsdate	0.1.7	preprocess_ts	Details dict schema_version: 1.0.0 software: dict name: tsdate version: 0.1.7 parameters: dict minimum_gap: 1000000 remove_telomeres: True filter_populations: False filter_individuals: False filter_sites: False delete_intervals: list list 0 209048.0 list 44004282.0 44077779.0 command: preprocess_ts environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict tskit: dict version: 0.5.8
06 September, 2024 at 01:13:05 AM	tskit	0.5.8	simplify	Details dict schema_version: 1.0.0 software: dict name: tskit version: 0.5.8 parameters: dict command: simplify TODO: add simplify parameters environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict kastore: dict version: 2.1.1
06 September, 2024 at 01:13:01 AM	tsinfer	0.3.3	infer	Details dict schema_version: 1.0.0 software: dict name: tsinfer version: 0.3.3 parameters: dict mismatch_ratio: None command: infer environment: dict libraries: dict zarr: dict version: 2.17.2 numcodecs: dict version: 0.12.1 lmdb: dict version: 1.4.1 tskit: dict version: 0.5.8 os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: list 3 11 9

To cite this software, please consult the citation manual: https://tskit.dev/citation/

Inspecting table nodes:

print(f"Nodes are {len(ts.tables.nodes)}")
pprint(ts.tables.nodes.asdict())
pprint(ts.tables.nodes[-1])

Nodes are 21091
{'flags': array([1, 1, 1, ..., 0, 0, 0], shape=(21091,), dtype=uint32),
 'individual': array([ 0,  0,  1, ..., -1, -1, -1], shape=(21091,), dtype=int32),
 'metadata': array([123, 125, 123, ...,  53,  55, 125], shape=(524909,), dtype=int8),
 'metadata_offset': array([     0,      2,      4, ..., 524807, 524857, 524909],
      shape=(21092,), dtype=uint64),
 'metadata_schema': '{"codec":"json","properties":{"ancestor_data_id":{"description":"The '
                    'ID of the tsinfer ancestor data node from which this node '
                    'is derived.","type":"number"}}}',
 'population': array([ 0,  0,  0, ..., -1, -1, -1], shape=(21091,), dtype=int32),
 'time': array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
       6.21120927e+01, 1.54211541e+01, 3.54785163e+04], shape=(21091,))}
NodeTableRow(flags=0,
             time=35478.516262884325,
             population=-1,
             individual=-1,
             metadata={'mn': 35478.516262884325, 'vr': 159043851.95505357})

It seems that are edges the most predominant data structure in tstree object:

print(f"Edges are {len(ts.tables.edges)}")
pprint(ts.tables.edges.asdict())
pprint(ts.tables.edges[-1])

Edges are 2266939
{'child': array([    8,    18,    22, ..., 19751, 20181, 20418],
      shape=(2266939,), dtype=int32),
 'left': array([11093140., 11093140., 15399236., ..., 44004281., 44004281.,
       44004281.], shape=(2266939,)),
 'metadata': array([], dtype=int8),
 'metadata_offset': array([0, 0, 0, ..., 0, 0, 0], shape=(2266940,), dtype=uint64),
 'metadata_schema': '',
 'parent': array([11688, 11688, 11688, ..., 21090, 21090, 21090],
      shape=(2266939,), dtype=int32),
 'right': array([21463416., 21463416., 15810520., ..., 44004282., 44004282.,
       44004282.], shape=(2266939,))}
EdgeTableRow(left=44004281.0,
             right=44004282.0,
             parent=21090,
             child=20418,
             metadata=b'')

Ok, try to focus only on a SNP. Get the tree for the first SNP and try to get stuff from tables:

POS = 209049
tree = ts.at(POS)
tree

Tree
Index	1
Interval	209 049-268 822 (59 773)
Roots	1
Nodes	11 605
Sites	1
Mutations	1
Total Branch Length	82 817.577

Now get the intervals of this tree. Then try to filter out edges between those positions:

interval = tree.interval
left_bound = interval.left
right_bound = interval.right

filtered_edges = ts.tables.edges[
    np.logical_and(ts.tables.edges.left >= left_bound, ts.tables.edges.right <= right_bound)]
filtered_edges[:10]

id	left	right	parent	child
0	209,049	268,822	17,327	54
1	209,049	268,822	17,327	59
2	209,049	268,822	17,327	68
3	209,049	268,822	17,327	92
4	209,049	268,822	17,327	110
5	209,049	268,822	17,327	117
6	209,049	268,822	17,327	120
7	209,049	268,822	17,327	131
8	209,049	268,822	17,327	134
9	209,049	268,822	17,327	155

len(filtered_edges)

Can I filter out the nodes in the same way? In this case I don’t have a left and right position like in the edge table. However, from the edge table I can derive which nodes are child of parents:

parents = set(filtered_edges.parent)
childs = set(filtered_edges.child)

node_ids = parents.union(childs)
print(f"Got {len(node_ids)} distinct nodes")

Got 1427 distinct nodes

tree.draw_svg(
    size=(800, 400),
    time_scale="log_time",
)

../_images/7d9245ffeeea0e4e807bfcbe50e89df6789611c8baabf152c4e73fc6a52bd568.svg