Threads fitting data#

From the threads documentation regarding fitting input mutations to the inferred ARG:

Threads does not by default guarantee that input mutations correspond to monophyletic clades in the inferred ARG. However, this can be achieved by providing the --fit_to_data flag. When threads infer is run with the --fit_to_data flag, the threading instructions are amended post-hoc such that each input mutation is represented by a clade in a marginal coalescence tree. As part of this process, Threads estimates the age of each mutation and then amends the local ARG topology.

However, the time required to fit the data can be significant, especially for large datasets. Another consideration is that when converting the ARG to a tree sequence, I can’t use the --add_mutations if --fit_to_data is not used: this is why the nf-treeseq pipeline currently models the two distinct options with --threads_fit_to_data parameter.

import tszip
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tskitetude import get_project_dir

Load threads tree sequences#

Let’s open the simplest results I have, with and without fitting the data, and try to understand the differences:

ts_threads = tszip.load(get_project_dir() / "results-threads/toInfer/threads/ts300I25k.1.trees.tsz")
ts_threads_fit = tszip.load(get_project_dir() / "results-threads/toInfer-fit/threads/ts300I25k.1.trees.tsz")

Let’s dump the two tree sequences:

ts_threads
Tree Sequence
Trees22 554
Sequence Length99 988 756
Time Unitsunknown
Sample Nodes3 200
Total Size145.5 MiB
Metadata
dict chromosome: 1
offset: 6714
Table Rows Size Has Metadata
Edges 3 446 739 105.2 MiB
Individuals 1 600 80.8 KiB
Migrations 0 8 Bytes
Mutations 0 16 Bytes
Nodes 520 417 13.9 MiB
Populations 8 271 Bytes
Provenances 1 385 Bytes
Sites 0 16 Bytes
Provenance Timestamp Software Name Version Command Full record
15 December, 2025 at 02:21:32 PM threads 0.2.1 Unknown
Details
dict
software:
dict name: threads
version: 0.2.1

parameters:
dict input_file: ts300I25k.1.biallelic.vcf.gz
metadata_added: True
populations_added: 8
individuals_added: 1600

timestamp: 2025-12-
15T14:21:32.293625+00:00
description: Tree sequence annotated with
population and individual
metadata
To cite this software, please consult the citation manual: https://tskit.dev/citation/
ts_threads_fit
Tree Sequence
Trees24 857
Sequence Length99 988 756
Time Unitsunknown
Sample Nodes3 200
Total Size354.1 MiB
Metadata
dict chromosome: 1
offset: 6714
Table Rows Size Has Metadata
Edges 8 329 337 254.2 MiB
Individuals 1 600 80.8 KiB
Migrations 0 8 Bytes
Mutations 24 924 900.6 KiB
Nodes 1 304 983 34.8 MiB
Populations 8 271 Bytes
Provenances 1 385 Bytes
Sites 24 924 608.5 KiB
Provenance Timestamp Software Name Version Command Full record
17 December, 2025 at 08:29:37 PM threads 0.2.1 Unknown
Details
dict
software:
dict name: threads
version: 0.2.1

parameters:
dict input_file: ts300I25k.1.biallelic.vcf.gz
metadata_added: True
populations_added: 8
individuals_added: 1600

timestamp: 2025-12-
17T20:29:37.025492+00:00
description: Tree sequence annotated with
population and individual
metadata
To cite this software, please consult the citation manual: https://tskit.dev/citation/

The fitted version seems bigger than the normal one. Let’s open also the tsinfer version of treeseq:

ts_infer = tszip.load(get_project_dir() / "results-reference/toInfer/tsinfer/ts300I25k.1.trees.tsz")
ts_infer
Tree Sequence
Trees22 569
Sequence Length1e+08
Time Unitsgenerations
Sample Nodes3 200
Total Size123.5 MiB
Metadata
dict
Table Rows Size Has Metadata
Edges 1 224 660 37.4 MiB
Individuals 1 600 82.2 KiB
Migrations 0 8 Bytes
Mutations 851 484 68.8 MiB
Nodes 80 645 6.7 MiB
Populations 8 185 Bytes
Provenances 4 2.8 KiB
Sites 24 947 1.2 MiB
Provenance Timestamp Software Name Version Command Full record
05 December, 2025 at 09:41:56 AM tsdate 0.2.4 variational_gamma
Details
dict schema_version: 1.0.0
software:
dict name: tsdate
version: 0.2.4

parameters:
dict mutation_rate: 5.87e-09
recombination_rate: None
time_units: None
progress: None
population_size: None
eps: 1e-10
max_iterations: 25
max_shape: 1000
rescaling_intervals: 1000
rescaling_iterations: 5
match_segregating_sites: False
regularise_roots: True
singletons_phased: True
command: variational_gamma

environment:
dict
os:
dict system: Linux
node: node1
release: 5.15.0-58-generic
version: #64-Ubuntu SMP Thu Jan 5
11:43:13 UTC 2023
machine: x86_64

python:
dict implementation: CPython
version: 3.12.12

libraries:
dict
tskit:
dict version: 1.0.0b3



resources:
dict elapsed_time: 108.51660299301147
user_time: 571.01
sys_time: 14.32
max_memory: 1730285568

05 December, 2025 at 09:40:07 AM tsdate 0.2.4 preprocess_ts
Details
dict schema_version: 1.0.0
software:
dict name: tsdate
version: 0.2.4

parameters:
dict minimum_gap: 1000000
erase_flanks: True
split_disjoint: True
filter_populations: False
filter_individuals: False
filter_sites: False
delete_intervals:
list
list 0
6713.0

list 99995468.0
100000000.0


command: preprocess_ts

environment:
dict
os:
dict system: Linux
node: node1
release: 5.15.0-58-generic
version: #64-Ubuntu SMP Thu Jan 5
11:43:13 UTC 2023
machine: x86_64

python:
dict implementation: CPython
version: 3.12.12

libraries:
dict
tskit:
dict version: 1.0.0b3



resources:
dict elapsed_time: 6.797327280044556
user_time: 462.99
sys_time: 13.57
max_memory: 1596768256

05 December, 2025 at 09:40:00 AM tskit 1.0.0b3 simplify
Details
dict schema_version: 1.0.0
software:
dict name: tskit
version: 1.0.0b3

parameters:
dict command: simplify
TODO: add simplify parameters

environment:
dict
os:
dict system: Linux
node: node1
release: 5.15.0-58-generic
version: #64-Ubuntu SMP Thu Jan 5
11:43:13 UTC 2023
machine: x86_64

python:
dict implementation: CPython
version: 3.12.12

libraries:
dict
kastore:
dict version: 2.1.1



05 December, 2025 at 09:39:55 AM tsinfer 0.5.0 infer
Details
dict schema_version: 1.0.0
software:
dict name: tsinfer
version: 0.5.0

parameters:
dict mismatch_ratio: None
path_compression: True
precision: None
post_process: None
command: infer

environment:
dict
libraries:
dict
zarr:
dict version: 2.18.3

numcodecs:
dict version: 0.13.1

lmdb:
dict version: 1.7.5

tskit:
dict version: 1.0.0b3


os:
dict system: Linux
node: node1
release: 5.15.0-58-generic
version: #64-Ubuntu SMP Thu Jan 5
11:43:13 UTC 2023
machine: x86_64

python:
dict implementation: CPython
version:
list 3
12
12



resources:
dict elapsed_time: 143.15887202322483
user_time: 303.99
sys_time: 10.53
max_memory: 1596768256

To cite this software, please consult the citation manual: https://tskit.dev/citation/

Simplify threads tree sequences#

Are threads file simplified? try to simplify them:

ts_threads_simplified = ts_threads.simplify()
ts_threads_fit_simplified = ts_threads_fit.simplify()

let’s collect those information in a DataFrame for easier comparison:

ts_list = [
    ts_threads,
    ts_threads_simplified,
    ts_threads_fit,
    ts_threads_fit_simplified,
    ts_infer
]

comparison_data = {
    'num_trees': [getattr(ts, 'num_trees') for ts in ts_list],
    'num_edges': [getattr(ts, 'num_edges') for ts in ts_list],
    'num_mutations': [getattr(ts, 'num_mutations') for ts in ts_list],
    'num_nodes': [getattr(ts, 'num_nodes') for ts in ts_list],
    'num_sites': [getattr(ts, 'num_sites') for ts in ts_list]
}

df_comparison = pd.DataFrame(comparison_data, index=[
    'threads',
    'threads-simplified',
    'threads-fit',
    'threads-fit-simplified',
    'tsinfer'])
df_comparison.transpose()
threads threads-simplified threads-fit threads-fit-simplified tsinfer
num_trees 22554 22554 24857 24857 22569
num_edges 3446739 1780159 8329337 4239485 1224660
num_mutations 0 0 24924 24924 851484
num_nodes 520417 520417 1304983 1304983 80645
num_sites 0 0 24924 24924 24947

Plot tree sequence stats#

Let’s plot this dataframe:

# A plot for each metric in a 2x3 grid with colors for each method (log scale)
metrics = df_comparison.columns
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

# Define colors for each method
colors = sns.color_palette("Set1", 5)

for i, metric in enumerate(metrics):
    data = df_comparison[[metric]].reset_index()
    data.columns = ['method', 'value']
    if metric in ['num_mutations']:
        data['value'] = np.log10(data['value'] + 1)
        axes[i].set_ylabel('log10(Count)')
    else:
        axes[i].set_ylabel('Count')
    sns.barplot(data=data, x='method', y='value', hue='method', ax=axes[i], palette=colors, legend=False)
    axes[i].set_title(metric)
    axes[i].set_xlabel('')
    axes[i].tick_params(axis='x', rotation=45)

# Hide the last subplot (since we only have 5 metrics)
axes[5].set_visible(False)

plt.tight_layout()
plt.show()
../_images/67cbc2bcc479f059a76012ca57475b8a7a38e1577ae85cd3c38fc6e37742430d.png

The number of trees are similar, between the two approaches (simplified or not). Number of edges differs: The unfitted version has less edges and simplified version has a number of edges similar to the tsinfer version. The fitted version and the fitted-simplified version have four and two times more edges than the tsinfer version. The number of mutations and sites are missed in the unfitted version, while the fitted version has more than half the mutations of the tsinfer version and the number of sites is similar. Lastly, the number of nodes is significantly higher in the unfitted version and more higher in the fitted version compared to the tsinfer one.