About tstree space#
A tstree object created from the smarter dataset occupies a lot of space. Even by providing REF allele as ancestor alleles and dropping mutation in tstree object there is still a lot of space occupied by the object. So try to deal with tables and inspect how much data is stored here
from pprint import pprint
import numpy as np
import tskit
from tskitetude import get_project_dir
ts = tskit.load(str(get_project_dir() / "results-reference/background_samples/tsinfer/SMARTER-OA-OAR3-forward-0.4.10.focal.26.trees"))
ts
|
|
|
|---|---|
| Trees | 659 |
| Sequence Length | 44 077 779 |
| Time Units | generations |
| Sample Nodes | 11 478 |
| Total Size | 87.9 MiB |
| Metadata |
dict |
| Table | Rows | Size | Has Metadata |
|---|---|---|---|
| Edges | 2 266 939 | 69.2 MiB | |
| Individuals | 5 739 | 353.1 KiB | ✅ |
| Migrations | 0 | 8 Bytes | |
| Mutations | 657 | 23.8 KiB | |
| Nodes | 21 091 | 1.1 MiB | ✅ |
| Populations | 202 | 4.7 KiB | ✅ |
| Provenances | 4 | 2.4 KiB | |
| Sites | 657 | 32.7 KiB | ✅ |
| Provenance Timestamp | Software Name | Version | Command | Full record |
|---|---|---|---|---|
| 06 September, 2024 at 01:19:45 AM | tsdate | 0.1.7 | inside_outside |
Detailsdictschema_version: 1.0.0
software:
dictname: tsdateversion: 0.1.7
parameters:
dictmutation_rate: 1e-08recombination_rate: None time_units: None progress: None population_size: 10000.0 eps: 1e-06 outside_standardize: True ignore_oldest_root: False probability_space: logarithmic num_threads: None cache_inside: False command: inside_outside
environment:
dict
os:
dictsystem: Linuxnode: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64
python:
dictimplementation: CPythonversion: 3.11.9
libraries:
dict
tskit:
dictversion: 0.5.8 |
| 06 September, 2024 at 01:13:11 AM | tsdate | 0.1.7 | preprocess_ts |
Detailsdictschema_version: 1.0.0
software:
dictname: tsdateversion: 0.1.7
parameters:
dictminimum_gap: 1000000remove_telomeres: True filter_populations: False filter_individuals: False filter_sites: False
delete_intervals:
listlist0209048.0 list44004282.044077779.0 command: preprocess_ts
environment:
dict
os:
dictsystem: Linuxnode: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64
python:
dictimplementation: CPythonversion: 3.11.9
libraries:
dict
tskit:
dictversion: 0.5.8 |
| 06 September, 2024 at 01:13:05 AM | tskit | 0.5.8 | simplify |
Detailsdictschema_version: 1.0.0
software:
dictname: tskitversion: 0.5.8
parameters:
dictcommand: simplifyTODO: add simplify parameters
environment:
dict
os:
dictsystem: Linuxnode: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64
python:
dictimplementation: CPythonversion: 3.11.9
libraries:
dict
kastore:
dictversion: 2.1.1 |
| 06 September, 2024 at 01:13:01 AM | tsinfer | 0.3.3 | infer |
Detailsdictschema_version: 1.0.0
software:
dictname: tsinferversion: 0.3.3
parameters:
dictmismatch_ratio: Nonecommand: infer
environment:
dict
libraries:
dict
zarr:
dictversion: 2.17.2
numcodecs:
dictversion: 0.12.1
lmdb:
dictversion: 1.4.1
tskit:
dictversion: 0.5.8
os:
dictsystem: Linuxnode: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64
python:
dictimplementation: CPython
version:
list311 9 |
Inspecting table nodes:
print(f"Nodes are {len(ts.tables.nodes)}")
pprint(ts.tables.nodes.asdict())
pprint(ts.tables.nodes[-1])
Nodes are 21091
{'flags': array([1, 1, 1, ..., 0, 0, 0], shape=(21091,), dtype=uint32),
'individual': array([ 0, 0, 1, ..., -1, -1, -1], shape=(21091,), dtype=int32),
'metadata': array([123, 125, 123, ..., 53, 55, 125], shape=(524909,), dtype=int8),
'metadata_offset': array([ 0, 2, 4, ..., 524807, 524857, 524909],
shape=(21092,), dtype=uint64),
'metadata_schema': '{"codec":"json","properties":{"ancestor_data_id":{"description":"The '
'ID of the tsinfer ancestor data node from which this node '
'is derived.","type":"number"}}}',
'population': array([ 0, 0, 0, ..., -1, -1, -1], shape=(21091,), dtype=int32),
'time': array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
6.21120927e+01, 1.54211541e+01, 3.54785163e+04], shape=(21091,))}
NodeTableRow(flags=0,
time=35478.516262884325,
population=-1,
individual=-1,
metadata={'mn': 35478.516262884325, 'vr': 159043851.95505357})
It seems that are edges the most predominant data structure in tstree object:
print(f"Edges are {len(ts.tables.edges)}")
pprint(ts.tables.edges.asdict())
pprint(ts.tables.edges[-1])
Edges are 2266939
{'child': array([ 8, 18, 22, ..., 19751, 20181, 20418],
shape=(2266939,), dtype=int32),
'left': array([11093140., 11093140., 15399236., ..., 44004281., 44004281.,
44004281.], shape=(2266939,)),
'metadata': array([], dtype=int8),
'metadata_offset': array([0, 0, 0, ..., 0, 0, 0], shape=(2266940,), dtype=uint64),
'metadata_schema': '',
'parent': array([11688, 11688, 11688, ..., 21090, 21090, 21090],
shape=(2266939,), dtype=int32),
'right': array([21463416., 21463416., 15810520., ..., 44004282., 44004282.,
44004282.], shape=(2266939,))}
EdgeTableRow(left=44004281.0,
right=44004282.0,
parent=21090,
child=20418,
metadata=b'')
Ok, try to focus only on a SNP. Get the tree for the first SNP and try to get stuff from tables:
POS = 209049
tree = ts.at(POS)
tree
|
|
|
|---|---|
| Index | 1 |
| Interval | 209 049-268 822 (59 773) |
| Roots | 1 |
| Nodes | 11 605 |
| Sites | 1 |
| Mutations | 1 |
| Total Branch Length | 82 817.577 |
Now get the intervals of this tree. Then try to filter out edges between those positions:
interval = tree.interval
left_bound = interval.left
right_bound = interval.right
filtered_edges = ts.tables.edges[
np.logical_and(ts.tables.edges.left >= left_bound, ts.tables.edges.right <= right_bound)]
filtered_edges[:10]
| id | left | right | parent | child | metadata |
|---|---|---|---|---|---|
| 0 | 209,049 | 268,822 | 17,327 | 54 | |
| 1 | 209,049 | 268,822 | 17,327 | 59 | |
| 2 | 209,049 | 268,822 | 17,327 | 68 | |
| 3 | 209,049 | 268,822 | 17,327 | 92 | |
| 4 | 209,049 | 268,822 | 17,327 | 110 | |
| 5 | 209,049 | 268,822 | 17,327 | 117 | |
| 6 | 209,049 | 268,822 | 17,327 | 120 | |
| 7 | 209,049 | 268,822 | 17,327 | 131 | |
| 8 | 209,049 | 268,822 | 17,327 | 134 | |
| 9 | 209,049 | 268,822 | 17,327 | 155 |
len(filtered_edges)
1426
Can I filter out the nodes in the same way? In this case I don’t have a left and right position like in the edge table. However, from the edge table I can derive which nodes are child of parents:
parents = set(filtered_edges.parent)
childs = set(filtered_edges.child)
node_ids = parents.union(childs)
print(f"Got {len(node_ids)} distinct nodes")
Got 1427 distinct nodes
tree.draw_svg(
size=(800, 400),
time_scale="log_time",
)