About tstree space

About tstree space#

A tstree object created from the smarter dataset occupies a lot of space. Even by providing REF allele as ancestor alleles and dropping mutation in tstree object there is still a lot of space occupied by the object. So try to deal with tables and inspect how much data is stored here

from pprint import pprint
import numpy as np

import tskit

from tskitetude import get_project_dir
ts = tskit.load(str(get_project_dir() / "results-reference/background_samples/tsinfer/SMARTER-OA-OAR3-forward-0.4.10.focal.26.trees"))
ts
Tree Sequence
Trees659
Sequence Length44 077 779
Time Unitsgenerations
Sample Nodes11 478
Total Size87.9 MiB
Metadata
dict
Table Rows Size Has Metadata
Edges 2 266 939 69.2 MiB
Individuals 5 739 353.1 KiB
Migrations 0 8 Bytes
Mutations 657 23.8 KiB
Nodes 21 091 1.1 MiB
Populations 202 4.7 KiB
Provenances 4 2.4 KiB
Sites 657 32.7 KiB
Provenance Timestamp Software Name Version Command Full record
06 September, 2024 at 01:19:45 AM tsdate 0.1.7 inside_outside
Details
dict schema_version: 1.0.0
software:
dict name: tsdate
version: 0.1.7

parameters:
dict mutation_rate: 1e-08
recombination_rate: None
time_units: None
progress: None
population_size: 10000.0
eps: 1e-06
outside_standardize: True
ignore_oldest_root: False
probability_space: logarithmic
num_threads: None
cache_inside: False
command: inside_outside

environment:
dict
os:
dict system: Linux
node: node1
release: 5.15.0-58-generic
version: #64-Ubuntu SMP Thu Jan 5
11:43:13 UTC 2023
machine: x86_64

python:
dict implementation: CPython
version: 3.11.9

libraries:
dict
tskit:
dict version: 0.5.8



06 September, 2024 at 01:13:11 AM tsdate 0.1.7 preprocess_ts
Details
dict schema_version: 1.0.0
software:
dict name: tsdate
version: 0.1.7

parameters:
dict minimum_gap: 1000000
remove_telomeres: True
filter_populations: False
filter_individuals: False
filter_sites: False
delete_intervals:
list
list 0
209048.0

list 44004282.0
44077779.0


command: preprocess_ts

environment:
dict
os:
dict system: Linux
node: node1
release: 5.15.0-58-generic
version: #64-Ubuntu SMP Thu Jan 5
11:43:13 UTC 2023
machine: x86_64

python:
dict implementation: CPython
version: 3.11.9

libraries:
dict
tskit:
dict version: 0.5.8



06 September, 2024 at 01:13:05 AM tskit 0.5.8 simplify
Details
dict schema_version: 1.0.0
software:
dict name: tskit
version: 0.5.8

parameters:
dict command: simplify
TODO: add simplify parameters

environment:
dict
os:
dict system: Linux
node: node1
release: 5.15.0-58-generic
version: #64-Ubuntu SMP Thu Jan 5
11:43:13 UTC 2023
machine: x86_64

python:
dict implementation: CPython
version: 3.11.9

libraries:
dict
kastore:
dict version: 2.1.1



06 September, 2024 at 01:13:01 AM tsinfer 0.3.3 infer
Details
dict schema_version: 1.0.0
software:
dict name: tsinfer
version: 0.3.3

parameters:
dict mismatch_ratio: None
command: infer

environment:
dict
libraries:
dict
zarr:
dict version: 2.17.2

numcodecs:
dict version: 0.12.1

lmdb:
dict version: 1.4.1

tskit:
dict version: 0.5.8


os:
dict system: Linux
node: node1
release: 5.15.0-58-generic
version: #64-Ubuntu SMP Thu Jan 5
11:43:13 UTC 2023
machine: x86_64

python:
dict implementation: CPython
version:
list 3
11
9



To cite this software, please consult the citation manual: https://tskit.dev/citation/

Inspecting table nodes:

print(f"Nodes are {len(ts.tables.nodes)}")
pprint(ts.tables.nodes.asdict())
pprint(ts.tables.nodes[-1])
Nodes are 21091
{'flags': array([1, 1, 1, ..., 0, 0, 0], shape=(21091,), dtype=uint32),
 'individual': array([ 0,  0,  1, ..., -1, -1, -1], shape=(21091,), dtype=int32),
 'metadata': array([123, 125, 123, ...,  53,  55, 125], shape=(524909,), dtype=int8),
 'metadata_offset': array([     0,      2,      4, ..., 524807, 524857, 524909],
      shape=(21092,), dtype=uint64),
 'metadata_schema': '{"codec":"json","properties":{"ancestor_data_id":{"description":"The '
                    'ID of the tsinfer ancestor data node from which this node '
                    'is derived.","type":"number"}}}',
 'population': array([ 0,  0,  0, ..., -1, -1, -1], shape=(21091,), dtype=int32),
 'time': array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
       6.21120927e+01, 1.54211541e+01, 3.54785163e+04], shape=(21091,))}
NodeTableRow(flags=0,
             time=35478.516262884325,
             population=-1,
             individual=-1,
             metadata={'mn': 35478.516262884325, 'vr': 159043851.95505357})

It seems that are edges the most predominant data structure in tstree object:

print(f"Edges are {len(ts.tables.edges)}")
pprint(ts.tables.edges.asdict())
pprint(ts.tables.edges[-1])
Edges are 2266939
{'child': array([    8,    18,    22, ..., 19751, 20181, 20418],
      shape=(2266939,), dtype=int32),
 'left': array([11093140., 11093140., 15399236., ..., 44004281., 44004281.,
       44004281.], shape=(2266939,)),
 'metadata': array([], dtype=int8),
 'metadata_offset': array([0, 0, 0, ..., 0, 0, 0], shape=(2266940,), dtype=uint64),
 'metadata_schema': '',
 'parent': array([11688, 11688, 11688, ..., 21090, 21090, 21090],
      shape=(2266939,), dtype=int32),
 'right': array([21463416., 21463416., 15810520., ..., 44004282., 44004282.,
       44004282.], shape=(2266939,))}
EdgeTableRow(left=44004281.0,
             right=44004282.0,
             parent=21090,
             child=20418,
             metadata=b'')

Ok, try to focus only on a SNP. Get the tree for the first SNP and try to get stuff from tables:

POS = 209049
tree = ts.at(POS)
tree
Tree
Index1
Interval209 049-268 822 (59 773)
Roots1
Nodes11 605
Sites1
Mutations1
Total Branch Length82 817.577

Now get the intervals of this tree. Then try to filter out edges between those positions:

interval = tree.interval
left_bound = interval.left
right_bound = interval.right

filtered_edges = ts.tables.edges[
    np.logical_and(ts.tables.edges.left >= left_bound, ts.tables.edges.right <= right_bound)]
filtered_edges[:10]
idleftrightparentchildmetadata
0209,049268,82217,32754
1209,049268,82217,32759
2209,049268,82217,32768
3209,049268,82217,32792
4209,049268,82217,327110
5209,049268,82217,327117
6209,049268,82217,327120
7209,049268,82217,327131
8209,049268,82217,327134
9209,049268,82217,327155
len(filtered_edges)
1426

Can I filter out the nodes in the same way? In this case I don’t have a left and right position like in the edge table. However, from the edge table I can derive which nodes are child of parents:

parents = set(filtered_edges.parent)
childs = set(filtered_edges.child)

node_ids = parents.union(childs)
print(f"Got {len(node_ids)} distinct nodes")
Got 1427 distinct nodes
tree.draw_svg(
    size=(800, 400),
    time_scale="log_time",
)
../_images/7d9245ffeeea0e4e807bfcbe50e89df6789611c8baabf152c4e73fc6a52bd568.svg