Getting started with tskit

Getting started with tskit#

This is the step-by-step tutorial found here. Here we generate an alignment using msprime, which is a python package to generate data to be used with tskit stuff

A number of different software programs can generate tree sequences. For the purposes of this tutorial we’ll use msprime to create an example tree sequence representing the genetic genealogy of a 10Mb chromosome in twenty diploid individuals. To make it a bit more interesting, we’ll simulate the effects of a selective sweep in the middle of the chromosome, then throw some neutral mutations onto the resulting tree sequence.

import msprime

pop_size=10_000
seq_length=10_000_000

sweep_model = msprime.SweepGenicSelection(
    position=seq_length/2, start_frequency=0.0001, end_frequency=0.9999, s=0.25, dt=1e-6)

ts = msprime.sim_ancestry(
    20,
    model=[sweep_model, msprime.StandardCoalescent()],
    population_size=pop_size,
    sequence_length=seq_length,
    recombination_rate=1e-8,
    random_seed=1234,  # only needed for repeatabilty
    )
# Optionally add finite-site mutations to the ts using the Jukes & Cantor model, creating SNPs
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=4321)
ts

Tree Sequence
Trees	11 167
Sequence Length	10 000 000
Time Units	generations
Sample Nodes	40
Total Size	2.4 MiB
Metadata	No Metadata

Table	Rows	Size	Has Metadata
Edges	36 372	1.1 MiB
Individuals	20	584 Bytes
Migrations	0	8 Bytes
Mutations	13 568	490.3 KiB
Nodes	7 342	200.8 KiB
Populations	1	224 Bytes	✅
Provenances	2	1.9 KiB
Sites	13 554	330.9 KiB

Provenance Timestamp	Software Name	Version	Command	Full record
19 December, 2025 at 01:22:45 PM	msprime	1.3.4	sim_mutations	Details dict schema_version: 1.0.0 software: dict name: msprime version: 1.3.4 parameters: dict command: sim_mutations tree_sequence: dict __constant__: __current_ts__ rate: 1e-08 model: None start_time: None end_time: None discrete_genome: None keep: None random_seed: 4321 environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.12.12 libraries: dict kastore: dict version: 2.1.1 tskit: dict version: 1.0.0b3 gsl: dict version: 2.6
19 December, 2025 at 01:22:45 PM	msprime	1.3.4	sim_ancestry	Details dict schema_version: 1.0.0 software: dict name: msprime version: 1.3.4 parameters: dict command: sim_ancestry samples: 20 demography: None sequence_length: 10000000 discrete_genome: None recombination_rate: 1e-08 gene_conversion_rate: None gene_conversion_tract_length: None population_size: 10000 ploidy: None model: list dict duration: None position: 5000000.0 start_frequency: 0.0001 end_frequency: 0.9999 s: 0.25 dt: 1e-06 __class__: msprime.ancestry.SweepGenicSel ection dict duration: None __class__: msprime.ancestry.StandardCoale scent initial_state: None start_time: None end_time: None record_migrations: None record_full_arg: None additional_nodes: None coalescing_segments_only: None num_labels: None random_seed: 1234 replicate_index: 0 environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.12.12 libraries: dict kastore: dict version: 2.1.1 tskit: dict version: 1.0.0b3 gsl: dict version: 2.6

To cite this software, please consult the citation manual: https://tskit.dev/citation/

We have tousand of trees in ts object. We have 20 dyploid individuals, so 40 nodes (one for genome? have I two genomes per individual as described by the tutorial?)

Processing trees#

Iterate over the trees with the trees() method:

for tree in ts.trees():
    print(f"Tree {tree.index} covers {tree.interval}")
    if tree.index >= 4:
        print("...")
        break
print(f"Tree {ts.last().index} covers {ts.last().interval}")

Tree 0 covers Interval(left=0.0, right=661.0)
Tree 1 covers Interval(left=661.0, right=3116.0)
Tree 2 covers Interval(left=3116.0, right=4451.0)
Tree 3 covers Interval(left=4451.0, right=4673.0)
Tree 4 covers Interval(left=4673.0, right=5020.0)
...
Tree 11166 covers Interval(left=9999635.0, right=10000000.0)

There are also last() and first() methods to access to the last and first trees respectively. Check if trees coalesce (not always true for forward simulations)

import time
elapsed = time.time()
for tree in ts.trees():
    if tree.has_multiple_roots:
        print("Tree {tree.index} has not coalesced")
        break
else:
    elapsed = time.time() - elapsed
    print(f"All {ts.num_trees} trees coalesced")
    print(f"Checked in {elapsed:.6g} secs")

All 11167 trees coalesced
Checked in 0.0111287 secs

Now that we know all trees have coalesced, we know that at each position in the genome all the 40 sample nodes must have one most recent common ancestor (MRCA). Below, we iterate over the trees, finding the IDs of the root (MRCA) node for e ach tree. The time of this root node can be found via the tskit.TreeSequence.node() method, which returns a Node object whose attributes include the node time:

import matplotlib.pyplot as plt

kb = [0]  # Starting genomic position
mrca_time = []
for tree in ts.trees():
    kb.append(tree.interval.right/1000)  # convert to kb
    mrca = ts.node(tree.root)  # For msprime tree sequences, the root node is the MRCA
    mrca_time.append(mrca.time)
plt.stairs(mrca_time, kb, baseline=None)
plt.xlabel("Genome position (kb)")
plt.ylabel("Time of root (or MRCA) in generations")
plt.yscale("log")
plt.show()

../_images/a90a6333ae79ed13976089ee7bc22263ff981cadd5ffcce8fde4eed734da39a3.png

It’s obvious that there’s something unusual about the trees in the middle of this chromosome, where the selective sweep occurred.

Although tskit is designed so that is it rapid to pass through trees sequentially, it is also possible to pull out individual trees from the middle of a tree sequence via the TreeSequence.at() method. Here’s how you can use that to extract the tree at location - the position of the sweep - and draw it using the Tree.draw_svg() method:

swept_tree = ts.at(5_000_000)  # or you can get e.g. the nth tree using ts.at_index(n)
intvl = swept_tree.interval
print(f"Tree number {swept_tree.index}, which runs from position {intvl.left} to {intvl.right}:")
# Draw it at a wide size, to make room for all 40 tips
swept_tree.draw_svg(size=(1000, 200))

Tree number 5382, which runs from position 4998293.0 to 5033047.0:

../_images/a0acd9d762adf650ec5e6e0bdf5804de279e51072e2748c04d302781b578e5c1.svg

This tree shows the classic signature of a recent expansion or selection event, with many long terminal branches, resulting in an excess of singleton mutations.

It can often be helpful to slim down a tree sequence so that it represents the genealogy of a smaller subset of the original samples. This can be done using the powerful TreeSequence.simplify() method.

The TreeSequence.draw_svg() method allows us to draw more than one tree: either the entire tree sequence, or (by using the x_lim parameter) a smaller region of the genome:

reduced_ts = ts.simplify([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])  # simplify to the first 10 samples
print("Genealogy of the first 10 samples for the first 5kb of the genome")
reduced_ts.draw_svg(x_lim=(0, 5000))

Genealogy of the first 10 samples for the first 5kb of the genome

../_images/4d459c04e29dd6ba239c8e274402ce8b6102c228f3d9a7704beeed55c08d83ec.svg

These are much more standard-looking coalescent trees, with far longer branches higher up in the tree, and therefore many more mutations at higher-frequencies.

In this tutorial we refer to objects, such as sample nodes, by their numerical IDs. These can change after simplification, and it is often more meaningful to work with metadata, such as sample and population names, which can be permanently attached to objects in the tree sequence. Such metadata is often incorporated automatically by the tools generating the tree sequence.

Processing sites and mutations#

For many purposes it may be better to focus on the genealogy of your samples, rather than the sites and mutations that define the genome sequence itself. Nevertheless, tskit also provides efficient ways to return Site object and Mutation objects from a tree sequence. For instance, under the finite sites model of mutation that we used above, multiple mutations can occur at some sites, and we can identify them by iterating over the sites using the TreeSequence.sites() method:

import numpy as np

num_muts = np.zeros(ts.num_sites, dtype=int)

for site in ts.sites():
    num_muts[site.id] = len(site.mutations)  # site.mutations is a list of mutations at the site

# Print out some info about mutations per site
for nmuts, count in enumerate(np.bincount(num_muts)):
    info = f"{count} sites"
    if nmuts > 1:
        info += f", with IDs {np.where(num_muts==nmuts)[0]},"
    print(info, f"have {nmuts} mutation" + ("s" if nmuts != 1 else ""))

sites have 0 mutations
sites have 1 mutation
sites, with IDs [ 2275  2287  2658  3789  4043  9694 11023 11780 11853 11890 11999 12023
12310], have 2 mutations

Processing genotypes#

At each site, the sample nodes will have a particular allelic state (or be flagged as Missing data). The TreeSequence.variants() method gives access to the full variation data. For efficiency, the genotypes at a site are returned as a numpy array of integers:

np.set_printoptions(linewidth=200)  # print genotypes on a single line

print("Genotypes")
for v in ts.variants():
    print(f"Site {v.site.id}: {v.genotypes}")
    if v.site.id >= 4:  # only print up to site ID 4
        print("...")
        break

Genotypes
Site 0: [0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0]
Site 1: [1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 0 1 0 1 0 0 1 1]
Site 2: [0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0]
Site 3: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Site 4: [0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0]
...

Tree sequences are optimised to look at all samples at one site, then all samples at an adjacent site, and so on along the genome. It is much less efficient look at all the sites for a single sample, then all the sites for the next sample, etc. In other words, you should generally iterate over sites, not samples. Nevertheless, all the alleles for a single sample can be obtained via the TreeSequence.haplotypes() method.

To find the actual allelic states at a site, you can refer to the alleles provided for each Variant: the genotype value is an index into this list. Here’s one way to print them out; for clarity this example also prints out the IDs of both the sample nodes (i.e. the genomes) and the diploid individuals in which each sample node resides.

samp_ids = ts.samples()

print("  ID of diploid individual: ", " ".join([f"{ts.node(s).individual:3}" for s in samp_ids]))
print("       ID of (sample) node: ", " ".join([f"{s:3}" for s in samp_ids]))

for v in ts.variants():
    site = v.site
    alleles = np.array(v.alleles)
    print(f"Site {site.id} (ancestral state '{site.ancestral_state}')",  alleles[v.genotypes])
    if site.id >= 4:  # only print up to site ID 4
        print("...")
        break

  ID of diploid individual:    0   0   1   1   2   2   3   3   4   4   5   5   6   6   7   7   8   8   9   9  10  10  11  11  12  12  13  13  14  14  15  15  16  16  17  17  18  18  19  19
       ID of (sample) node:    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39
Site 0 (ancestral state 'G') ['G' 'G' 'T' 'T' 'G' 'G' 'T' 'G' 'G' 'G' 'G' 'G' 'G' 'G' 'T' 'G' 'G' 'G' 'G' 'G' 'G' 'G' 'G' 'T' 'G' 'G' 'T' 'G' 'G' 'G' 'G' 'G' 'T' 'G' 'T' 'G' 'G' 'G' 'G' 'G']
Site 1 (ancestral state 'T') ['G' 'G' 'T' 'T' 'G' 'G' 'T' 'G' 'G' 'G' 'G' 'G' 'G' 'G' 'T' 'G' 'G' 'G' 'G' 'G' 'G' 'G' 'G' 'T' 'T' 'G' 'T' 'G' 'G' 'G' 'G' 'G' 'T' 'G' 'T' 'G' 'T' 'T' 'G' 'G']
Site 2 (ancestral state 'T') ['T' 'T' 'C' 'C' 'T' 'T' 'C' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'C' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'C' 'T' 'T' 'C' 'T' 'T' 'T' 'T' 'T' 'C' 'T' 'C' 'T' 'T' 'T' 'T' 'T']
Site 3 (ancestral state 'T') ['T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'A' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T']
Site 4 (ancestral state 'T') ['T' 'T' 'A' 'A' 'T' 'T' 'A' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'A' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'A' 'T' 'T' 'A' 'T' 'T' 'T' 'T' 'T' 'A' 'T' 'A' 'T' 'T' 'T' 'T' 'T']
...

Getting started with tskit

Contents

Getting started with tskit#

Processing trees#

Processing sites and mutations#

Processing genotypes#