TSINFER tutorial#
Toy example#
Supposing to have phased haplotype data for five samples at six sites like this:
sample haplotype
0 AGCGAT
1 TGACAG
2 AGACAC
3 ACCGCT
4 ACCGCT
Before to derive a tstree object that model these data, in need to import data
with tsinfer: this requires to know the ancestral alleles first:
import string
import numpy as np
import tsinfer
import cyvcf2
import json
import tsdate
from tqdm.notebook import tqdm
from tskit import MISSING_DATA
from tskitetude import get_data_dir
with tsinfer.SampleData(sequence_length=6) as sample_data:
sample_data.add_site(0, [0, 1, 0, 0, 0], ["A", "T"], ancestral_allele=0)
sample_data.add_site(1, [0, 0, 0, 1, 1], ["G", "C"], ancestral_allele=0)
sample_data.add_site(2, [0, 1, 1, 0, 0], ["C", "A"], ancestral_allele=0)
sample_data.add_site(3, [0, 1, 1, 0, 0], ["G", "C"], ancestral_allele=MISSING_DATA)
sample_data.add_site(4, [0, 0, 0, 1, 1], ["A", "C"], ancestral_allele=0)
sample_data.add_site(5, [0, 1, 2, 0, 0], ["T", "G", "C"], ancestral_allele=0)
/tmp/ipykernel_44000/2393816675.py:1: DeprecationWarning: SampleData is deprecated
with tsinfer.SampleData(sequence_length=6) as sample_data:
tsinfer.Sampledata is the object required for inferring a tstree object. Using
the add_site() method a can add information for each SNP respectively. The first
argument is the SNP position: here for simplicity we track SNP in positional order
but it can be any positive value (even float). The only requirement is that this
position should be unique and added in increasing order. The 2nd argument is for
the genotypes of each sample in this position: is and index of the allele I can
find in the 3rd argument. If I have a missing data, I need to use the tskit.MISSING_DATA
The last argument is the index of the ancestral allele. Not all the sites are used
to infer the tree object: sites with missing data or ancestral alleles or sites with
more than 2 genotypes are not considered by will be modeled in the resulting tree.
Once we have the SampleData instance, we can infer a tstree object using
tsinfer.infer:
ts = tsinfer.infer(sample_data)
This ts object is a full Tree Sequence object:
ts
|
|
|
|---|---|
| Trees | 1 |
| Sequence Length | 6 |
| Time Units | uncalibrated |
| Sample Nodes | 5 |
| Total Size | 2.3 KiB |
| Metadata |
dict |
| Table | Rows | Size | Has Metadata |
|---|---|---|---|
| Edges | 7 | 232 Bytes | |
| Individuals | 5 | 174 Bytes | ✅ |
| Migrations | 0 | 8 Bytes | |
| Mutations | 7 | 275 Bytes | |
| Nodes | 8 | 469 Bytes | ✅ |
| Populations | 0 | 8 Bytes | |
| Provenances | 1 | 794 Bytes | |
| Sites | 6 | 337 Bytes | ✅ |
| Provenance Timestamp | Software Name | Version | Command | Full record |
|---|---|---|---|---|
| 19 December, 2025 at 01:23:52 PM | tsinfer | 0.5.0 | infer |
Detailsdictschema_version: 1.0.0
software:
dictname: tsinferversion: 0.5.0
parameters:
dictmismatch_ratio: Nonepath_compression: True precision: None post_process: None command: infer
environment:
dict
libraries:
dict
zarr:
dictversion: 2.18.7
numcodecs:
dictversion: 0.15.0
lmdb:
dictversion: 1.7.5
tskit:
dictversion: 1.0.0b3
os:
dictsystem: Linuxnode: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64
python:
dictimplementation: CPython
version:
list312 12
resources:
dictelapsed_time: 1.3854495584964752user_time: 1.2199999999999989 sys_time: 0.10000000000000009 max_memory: 680337408 |
This Tree sequence object can be analyzed as usual:
print("==Haplotypes==")
for sample_id, h in enumerate(ts.haplotypes()):
print(sample_id, h, sep="\t")
ts.draw_svg(y_axis=True)
==Haplotypes==
0 AGCGAT
1 TGACAG
2 AGACAC
3 ACCGCT
4 ACCGCT
If I understand correctly, tsinfer can impute missing data (check this). For the
data I put, there’s a root note with three childs: this is also known as polytomy.
Every internal node represent an ancestral sequence, By default, the time of those
nodes is not measured in years or generations, but is the frequency of the shared
derived alleles on which the ancestral sequence is based. This is why the time is
uncalibrated in the graph above.
# Extra code to label and order the tips alphabetically rather than numerically
labels = {i: string.ascii_lowercase[i] for i in range(ts.num_nodes)}
genome_order = [n for n in ts.first().nodes(order="minlex_postorder") if ts.node(n).is_sample()]
labels.update({n: labels[i] for i, n in enumerate(genome_order)})
style1 = (
".node:not(.sample) > .sym, .node:not(.sample) > .lab {visibility: hidden;}"
".mut {font-size: 12px} .y-axis .tick .lab {font-size: 85%}")
sz = (800, 250) # size of the plot, slightly larger than the default
# ticks = [0, 5000, 10000, 15000, 20000]
# get max generations time:
max_time = ts.node(ts.get_num_nodes() - 1).time
ticks = np.linspace(0, max_time, 5)
ts.draw_svg(
size=sz, node_labels=labels, style=style1, y_label="Time ago (uncalibrated)",
y_axis=True, y_ticks=ticks)
Inferring dates#
To infer dates we can use tsdate.date with default parameter and by
specifying Effective population sizes and mutation rate:
dated_ts = tsdate.date(ts, method="inside_outside", mutation_rate=1e-8, population_size=1e4, progress=True)
dated_ts
|
|
|
|---|---|
| Trees | 1 |
| Sequence Length | 6 |
| Time Units | generations |
| Sample Nodes | 5 |
| Total Size | 3.4 KiB |
| Metadata |
dict |
| Table | Rows | Size | Has Metadata |
|---|---|---|---|
| Edges | 7 | 232 Bytes | |
| Individuals | 5 | 174 Bytes | ✅ |
| Migrations | 0 | 8 Bytes | |
| Mutations | 7 | 275 Bytes | |
| Nodes | 8 | 691 Bytes | ✅ |
| Populations | 0 | 8 Bytes | |
| Provenances | 2 | 1.6 KiB | |
| Sites | 6 | 337 Bytes | ✅ |
| Provenance Timestamp | Software Name | Version | Command | Full record |
|---|---|---|---|---|
| 19 December, 2025 at 01:23:52 PM | tsdate | 0.2.4 | inside_outside |
Detailsdictschema_version: 1.0.0
software:
dictname: tsdateversion: 0.2.4
parameters:
dictmutation_rate: 1e-08recombination_rate: None time_units: None progress: True population_size: 10000.0 eps: 1e-10 outside_standardize: True ignore_oldest_root: False probability_space: logarithmic num_threads: None cache_inside: False command: inside_outside
environment:
dict
os:
dictsystem: Linuxnode: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64
python:
dictimplementation: CPythonversion: 3.12.12
libraries:
dict
tskit:
dictversion: 1.0.0b3
resources:
dictelapsed_time: 0.22470426559448242user_time: 60.7 sys_time: 3.39 max_memory: 681193472 |
| 19 December, 2025 at 01:23:52 PM | tsinfer | 0.5.0 | infer |
Detailsdictschema_version: 1.0.0
software:
dictname: tsinferversion: 0.5.0
parameters:
dictmismatch_ratio: Nonepath_compression: True precision: None post_process: None command: infer
environment:
dict
libraries:
dict
zarr:
dictversion: 2.18.7
numcodecs:
dictversion: 0.15.0
lmdb:
dictversion: 1.7.5
tskit:
dictversion: 1.0.0b3
os:
dictsystem: Linuxnode: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64
python:
dictimplementation: CPython
version:
list312 12
resources:
dictelapsed_time: 1.3854495584964752user_time: 1.2199999999999989 sys_time: 0.10000000000000009 max_memory: 680337408 |
dated_ts.draw_svg(y_axis=True, size=(800, 250))
Data example#
This is the Data example part of the tutorial:
def add_diploid_sites(vcf, samples):
"""
Read the sites in the vcf and add them to the samples object.
"""
# You may want to change the following line, e.g. here we allow
# "*" (a spanning deletion) to be a valid allele state
allele_chars = set("ATGCatgc*")
pos = 0
progressbar = tqdm(total=samples.sequence_length, desc="Read VCF", unit='bp')
for variant in vcf: # Loop over variants, each assumed at a unique site
progressbar.update(variant.POS - pos)
if pos == variant.POS:
print(f"Duplicate entries at position {pos}, ignoring all but the first")
continue
else:
pos = variant.POS
if any([not phased for _, _, phased in variant.genotypes]):
raise ValueError("Unphased genotypes for variant at position", pos)
alleles = [variant.REF.upper()] + [v.upper() for v in variant.ALT]
ancestral = variant.INFO.get("AA", ".") # "." means unknown
# some VCFs (e.g. from 1000G) have many values in the AA field: take the 1st
ancestral = ancestral.split("|")[0].upper()
if ancestral == "." or ancestral == "":
ancestral_allele = MISSING_DATA
# alternatively, you could specify `ancestral = variant.REF.upper()`
else:
ancestral_allele = alleles.index(ancestral)
# Check we have ATCG alleles
for a in alleles:
if len(set(a) - allele_chars) > 0:
print(f"Ignoring site at pos {pos}: allele {a} not in {allele_chars}")
continue
# Map original allele indexes to their indexes in the new alleles list.
genotypes = [g for row in variant.genotypes for g in row[0:2]]
samples.add_site(pos, genotypes, alleles, ancestral_allele=ancestral_allele)
def chromosome_length(vcf):
assert len(vcf.seqlens) == 1
return vcf.seqlens[0]
# NB: could also read from an online version by setting vcf_location to
# "https://github.com/tskit-dev/tsinfer/raw/main/docs/_static/P_dom_chr24_phased.vcf.gz"
vcf_location = get_data_dir() / "P_dom_chr24_phased.vcf.gz"
samples_location = get_data_dir() / "P_dom_chr24_phased.samples"
vcf = cyvcf2.VCF(vcf_location)
with tsinfer.SampleData(
path=str(samples_location), sequence_length=chromosome_length(vcf)
) as samples:
add_diploid_sites(vcf, samples)
print(
"Sample file created for {} samples ".format(samples.num_samples)
+ "({} individuals) ".format(samples.num_individuals)
+ "with {} variable sites.".format(samples.num_sites),
flush=True,
)
# Do the inference
ts = tsinfer.infer(samples)
print(
"Inferred tree sequence: {} trees over {} Mb ({} edges)".format(
ts.num_trees, ts.sequence_length / 1e6, ts.num_edges
)
)
[W::bcf_hrec_check] Missing ID attribute in one or more header lines
[W::bcf_hdr_register_hrec] An INFO field has no Type defined. Assuming String
[W::bcf_hdr_register_hrec] An INFO field has no Number defined. Assuming '.'
/home/cozzip/.cache/pypoetry/virtualenvs/tskitetude-hh-GIRXc-py3.12/lib/python3.12/site-packages/tsinfer/formats.py:530: FutureWarning: The LMDBStore is deprecated and will be removed in a Zarr-Python version 3, see https://github.com/zarr-developers/zarr-python/issues/1274 for more information.
return zarr.LMDBStore(self.path, subdir=False, map_size=map_size)
/tmp/ipykernel_44000/1163235217.py:60: DeprecationWarning: SampleData is deprecated
with tsinfer.SampleData(
Sample file created for 20 samples (20 individuals) with 13192 variable sites.
/home/cozzip/.cache/pypoetry/virtualenvs/tskitetude-hh-GIRXc-py3.12/lib/python3.12/site-packages/tsinfer/formats.py:104: FutureWarning: The LMDBStore is deprecated and will be removed in a Zarr-Python version 3, see https://github.com/zarr-developers/zarr-python/issues/1274 for more information.
store = zarr.LMDBStore(
Inferred tree sequence: 6751 trees over 7.077728 Mb (35722 edges)
There’s also a parallel version of this add_diploid_sites here.
Well, until now I added 20 different individuals (with a single chromosome) instead
of adding 10 diploid individuals. I can change something to add more chromosomes
to the same individual, and even add other meta information to the three:
def add_populations(vcf, samples):
"""
Add tsinfer Population objects and returns a list of IDs corresponding to the VCF samples.
"""
# In this VCF, the first letter of the sample name refers to the population
samples_first_letter = [sample_name[0] for sample_name in vcf.samples]
pop_lookup = {}
pop_lookup["8"] = samples.add_population(metadata={"country": "Norway"})
pop_lookup["F"] = samples.add_population(metadata={"country": "France"})
return [pop_lookup[first_letter] for first_letter in samples_first_letter]
def add_diploid_individuals(vcf, samples, populations):
for name, population in zip(vcf.samples, populations):
samples.add_individual(ploidy=2, metadata={"name": name}, population=population)
# Repeat as previously but add both populations and individuals
vcf_location = get_data_dir() / "P_dom_chr24_phased.vcf.gz"
samples_location = get_data_dir() / "P_dom_chr24_phased.samples"
vcf = cyvcf2.VCF(vcf_location)
with tsinfer.SampleData(
path=str(samples_location), sequence_length=chromosome_length(vcf)
) as samples:
populations = add_populations(vcf, samples)
add_diploid_individuals(vcf, samples, populations)
add_diploid_sites(vcf, samples)
print(
"Sample file created for {} samples ".format(samples.num_samples)
+ "({} individuals) ".format(samples.num_individuals)
+ "with {} variable sites.".format(samples.num_sites),
flush=True,
)
# Do the inference
sparrow_ts = tsinfer.infer(samples)
print(
"Inferred tree sequence `{}`: {} trees over {} Mb".format(
"sparrow_ts", sparrow_ts.num_trees, sparrow_ts.sequence_length / 1e6
)
)
# Check the metadata
for sample_node_id in sparrow_ts.samples():
individual_id = sparrow_ts.node(sample_node_id).individual
population_id = sparrow_ts.node(sample_node_id).population
print(
"Node",
sample_node_id,
"labels a chr24 sampled from individual",
json.loads(sparrow_ts.individual(individual_id).metadata),
"in",
json.loads(sparrow_ts.population(population_id).metadata)["country"],
)
[W::bcf_hrec_check] Missing ID attribute in one or more header lines
[W::bcf_hdr_register_hrec] An INFO field has no Type defined. Assuming String
[W::bcf_hdr_register_hrec] An INFO field has no Number defined. Assuming '.'
/tmp/ipykernel_44000/2373803098.py:26: DeprecationWarning: SampleData is deprecated
with tsinfer.SampleData(
Sample file created for 20 samples (10 individuals) with 13192 variable sites.
Inferred tree sequence `sparrow_ts`: 6751 trees over 7.077728 Mb
Node 0 labels a chr24 sampled from individual {'name': '8934547'} in Norway
Node 1 labels a chr24 sampled from individual {'name': '8934547'} in Norway
Node 2 labels a chr24 sampled from individual {'name': '8L19766'} in Norway
Node 3 labels a chr24 sampled from individual {'name': '8L19766'} in Norway
Node 4 labels a chr24 sampled from individual {'name': '8M31651'} in Norway
Node 5 labels a chr24 sampled from individual {'name': '8M31651'} in Norway
Node 6 labels a chr24 sampled from individual {'name': '8N05890'} in Norway
Node 7 labels a chr24 sampled from individual {'name': '8N05890'} in Norway
Node 8 labels a chr24 sampled from individual {'name': '8N73604'} in Norway
Node 9 labels a chr24 sampled from individual {'name': '8N73604'} in Norway
Node 10 labels a chr24 sampled from individual {'name': 'FR041'} in France
Node 11 labels a chr24 sampled from individual {'name': 'FR041'} in France
Node 12 labels a chr24 sampled from individual {'name': 'FR044'} in France
Node 13 labels a chr24 sampled from individual {'name': 'FR044'} in France
Node 14 labels a chr24 sampled from individual {'name': 'FR046'} in France
Node 15 labels a chr24 sampled from individual {'name': 'FR046'} in France
Node 16 labels a chr24 sampled from individual {'name': 'FR048'} in France
Node 17 labels a chr24 sampled from individual {'name': 'FR048'} in France
Node 18 labels a chr24 sampled from individual {'name': 'FR050'} in France
Node 19 labels a chr24 sampled from individual {'name': 'FR050'} in France
Analysis#
Now analyses can be done with tskit libraries. I can’t show the full tree sequences
for this object, I can focus to a segment however:
colours = {"Norway": "red", "France": "blue"}
colours_for_node = {}
for n in sparrow_ts.samples():
population_data = sparrow_ts.population(sparrow_ts.node(n).population)
colours_for_node[n] = colours[json.loads(population_data.metadata)["country"]]
individual_for_node = {}
for n in sparrow_ts.samples():
individual_data = sparrow_ts.individual(sparrow_ts.node(n).individual)
individual_for_node[n] = json.loads(individual_data.metadata)["name"]
tree = sparrow_ts.at(1e6)
tree.draw(
height=700,
width=1200,
node_labels=individual_for_node,
node_colours=colours_for_node,
)