Nucleotide diversity

Nucleotide diversity#

Using REF as ancestral alleles#

We are wondering how the ancestral allele inference effect the final treesequence object. We are testing a tstree object created by imposing the reference allele as the ancestral allele. The effect is that there are more or less the same mutations as the number of variants (while in the case of the ancestrall allele calculated with est-sfs we can see million of mutations). Let’s start by loading data for this REF-based treeseq object

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import tskit

from tskitetude import get_project_dir
from tskitetude.helper import create_windows

ts_reference = tskit.load(str(get_project_dir() / "results-reference/background_samples/tsinfer/SMARTER-OA-OAR3-forward-0.4.10.focal.26.trees"))
ts_reference

Tree Sequence
Trees	659
Sequence Length	44 077 779
Time Units	generations
Sample Nodes	11 478
Total Size	87.9 MiB
Metadata	dict

Table	Rows	Size	Has Metadata
Edges	2 266 939	69.2 MiB
Individuals	5 739	353.1 KiB	✅
Migrations	0	8 Bytes
Mutations	657	23.8 KiB
Nodes	21 091	1.1 MiB	✅
Populations	202	4.7 KiB	✅
Provenances	4	2.4 KiB
Sites	657	32.7 KiB	✅

Provenance Timestamp	Software Name	Version	Command	Full record
06 September, 2024 at 01:19:45 AM	tsdate	0.1.7	inside_outside	Details dict schema_version: 1.0.0 software: dict name: tsdate version: 0.1.7 parameters: dict mutation_rate: 1e-08 recombination_rate: None time_units: None progress: None population_size: 10000.0 eps: 1e-06 outside_standardize: True ignore_oldest_root: False probability_space: logarithmic num_threads: None cache_inside: False command: inside_outside environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict tskit: dict version: 0.5.8
06 September, 2024 at 01:13:11 AM	tsdate	0.1.7	preprocess_ts	Details dict schema_version: 1.0.0 software: dict name: tsdate version: 0.1.7 parameters: dict minimum_gap: 1000000 remove_telomeres: True filter_populations: False filter_individuals: False filter_sites: False delete_intervals: list list 0 209048.0 list 44004282.0 44077779.0 command: preprocess_ts environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict tskit: dict version: 0.5.8
06 September, 2024 at 01:13:05 AM	tskit	0.5.8	simplify	Details dict schema_version: 1.0.0 software: dict name: tskit version: 0.5.8 parameters: dict command: simplify TODO: add simplify parameters environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict kastore: dict version: 2.1.1
06 September, 2024 at 01:13:01 AM	tsinfer	0.3.3	infer	Details dict schema_version: 1.0.0 software: dict name: tsinfer version: 0.3.3 parameters: dict mismatch_ratio: None command: infer environment: dict libraries: dict zarr: dict version: 2.17.2 numcodecs: dict version: 0.12.1 lmdb: dict version: 1.4.1 tskit: dict version: 0.5.8 os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: list 3 11 9

To cite this software, please consult the citation manual: https://tskit.dev/citation/

I need to calculate nucleotide diversity per site. The only way to do this seems to be calculating windows containing the SNPs an then calculating the nucleotide diversity with the tskit.TreeSequence.diversity function. I’ve created a function create_windows in helper module:

# the last index is a simply a 2 step starting from position 1
reference_diversity = ts_reference.diversity(windows=create_windows(ts_reference))[1::2]
reference_diversity[:10]

array([0.42881531, 0.48983504, 0.49397942, 0.27006834, 0.45935871,
       0.24743806, 0.14849395, 0.3905915 , 0.46073859, 0.4897602 ])

Now let’s compare the nucleotide diversity calculated using vcftools: here’s the command line to calculate nucleotide diversity per site:

cd results-reference/background_samples/focal
vcftools --gzvcf SMARTER-OA-OAR3-forward-0.4.10.focal.26.vcf.gz --out allsamples_pi --site-pi

The allsamples_pi.sites.pi is a TSV file with the positions and the nucleotide diversity. Read it with pandas:

vcftools_diversity = pd.read_csv(get_project_dir() / "results-reference/background_samples/focal/allsamples_pi.sites.pi", sep="\t")
vcftools_diversity.head()

	CHROM	POS	PI
0	26	209049	0.428815
1	26	268822	0.489835
2	26	285471	0.493979
3	26	405330	0.270068
4	26	528308	0.459359

Are this values similar?

np.isclose(reference_diversity, vcftools_diversity["PI"], atol=1e-6).all()

np.True_

Calculate diversity using branch:

# the last index is a simply a 2 step starting from position 1
reference_diversity_branch = ts_reference.diversity(mode='branch', windows=create_windows(ts_reference))[1::2]
reference_diversity_branch[:10]

array([24.21275953, 24.51004717, 20.56900189, 20.18076343, 23.28547977,
       17.66467911, 37.04506148, 16.55024199, 63.8544266 , 15.75780889])

Try to plot the tow different diversities with vcftools output:

plt.scatter(reference_diversity, vcftools_diversity["PI"])

<matplotlib.collections.PathCollection at 0x7f6f3f007230>

../_images/9a48b0e49f6d9ee0d1a08cab21a293bf7b37855f776249c5a13a9a9b6ad34192.png

plt.scatter(reference_diversity_branch, vcftools_diversity["PI"])
plt.xlim(0, 1000)

(0.0, 1000.0)

../_images/fb8e7050c0c78f1c7203bde327137689f93c8bc78e25dda46b7a7c82239ffb0f.png

EST-SFS output as ancestral alleles#

Can we calculate nucleotide diversity using the tree files generated by the pipeline using the est-sfs output as ancestral alleles?

ts_estsfs = tskit.load(str(get_project_dir() / "results-estsfs/background_samples/tsinfer/SMARTER-OA-OAR3-forward-0.4.10.focal.26.trees"))
ts_estsfs

Tree Sequence
Trees	539
Sequence Length	44 077 779
Time Units	generations
Sample Nodes	11 478
Total Size	87.5 MiB
Metadata	dict

Table	Rows	Size	Has Metadata
Edges	1 962 608	59.9 MiB
Individuals	5 739	353.1 KiB	✅
Migrations	0	8 Bytes
Mutations	321 178	11.3 MiB
Nodes	19 785	977.3 KiB	✅
Populations	202	4.7 KiB	✅
Provenances	4	2.4 KiB
Sites	657	33.3 KiB	✅

Provenance Timestamp	Software Name	Version	Command	Full record
05 September, 2024 at 11:56:43 PM	tsdate	0.1.7	inside_outside	Details dict schema_version: 1.0.0 software: dict name: tsdate version: 0.1.7 parameters: dict mutation_rate: 1e-08 recombination_rate: None time_units: None progress: None population_size: 10000.0 eps: 1e-06 outside_standardize: True ignore_oldest_root: False probability_space: logarithmic num_threads: None cache_inside: False command: inside_outside environment: dict os: dict system: Linux node: node2 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict tskit: dict version: 0.5.8
05 September, 2024 at 11:49:51 PM	tsdate	0.1.7	preprocess_ts	Details dict schema_version: 1.0.0 software: dict name: tsdate version: 0.1.7 parameters: dict minimum_gap: 1000000 remove_telomeres: True filter_populations: False filter_individuals: False filter_sites: False delete_intervals: list list 0 209048.0 list 44004282.0 44077779.0 command: preprocess_ts environment: dict os: dict system: Linux node: node2 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict tskit: dict version: 0.5.8
05 September, 2024 at 11:49:46 PM	tskit	0.5.8	simplify	Details dict schema_version: 1.0.0 software: dict name: tskit version: 0.5.8 parameters: dict command: simplify TODO: add simplify parameters environment: dict os: dict system: Linux node: node2 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict kastore: dict version: 2.1.1
05 September, 2024 at 11:49:43 PM	tsinfer	0.3.3	infer	Details dict schema_version: 1.0.0 software: dict name: tsinfer version: 0.3.3 parameters: dict mismatch_ratio: None command: infer environment: dict libraries: dict zarr: dict version: 2.17.2 numcodecs: dict version: 0.12.1 lmdb: dict version: 1.4.1 tskit: dict version: 0.5.8 os: dict system: Linux node: node2 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: list 3 11 9

To cite this software, please consult the citation manual: https://tskit.dev/citation/

Same stuff as before

# the last index is a simply a 2 step starting from position 1
estsfs_diversity = ts_estsfs.diversity(windows=create_windows(ts_estsfs))[1::2]
estsfs_diversity[:10]

array([0.42881531, 0.48983504, 0.49397942, 0.27006834, 0.45935871,
       0.24743806, 0.14849395, 0.3905915 , 0.46073859, 0.4897602 ])

Are this values similar to the values calculated using VCFtools?

np.isclose(estsfs_diversity, vcftools_diversity["PI"], atol=1e-6).all()

np.True_

So nucleotide diversity is the same in both cases (using the REF as ancestral allele and using the est-sfs output as ancestral allele). Calculate diversity using branch:

# the last index is a simply a 2 step starting from position 1
estsfs_diversity_branch = ts_estsfs.diversity(mode='branch', windows=create_windows(ts_estsfs))[1::2]
estsfs_diversity_branch[:10]

array([374.15499399, 374.15499399, 374.15499399, 308.89499625,
       312.0500738 , 314.27230249, 292.80638906, 294.28489121,
       482.83633406, 482.83633406])

Try to plot the tow different diversities with vcftools output:

plt.scatter(estsfs_diversity, vcftools_diversity["PI"])

<matplotlib.collections.PathCollection at 0x7f6f3d5cde50>

plt.scatter(estsfs_diversity_branch, vcftools_diversity["PI"])
plt.xlim(0, 1000)

(0.0, 1000.0)

../_images/40b4fcd566130f115575adb09827368a43e465e2a9dbd7dcf89ab76c2d254cb9.png

Compara output as ancestral alleles#

Calculate nucleotide diversity using the tree files generated by the pipeline using the compara approach as ancestral alleles.

ts_compara = tskit.load(str(get_project_dir() / "results-compara/background_samples/tsinfer/SMARTER-OA-OAR3-forward-0.4.10.focal.26.trees"))
ts_compara

Tree Sequence
Trees	496
Sequence Length	44 077 779
Time Units	generations
Sample Nodes	11 478
Total Size	89.9 MiB
Metadata	dict

Table	Rows	Size	Has Metadata
Edges	1 838 362	56.1 MiB
Individuals	5 739	353.1 KiB	✅
Migrations	0	8 Bytes
Mutations	524 747	18.5 MiB
Nodes	19 188	932.4 KiB	✅
Populations	202	4.7 KiB	✅
Provenances	4	2.4 KiB
Sites	657	33.5 KiB	✅

Provenance Timestamp	Software Name	Version	Command	Full record
06 September, 2024 at 02:30:20 AM	tsdate	0.1.7	inside_outside	Details dict schema_version: 1.0.0 software: dict name: tsdate version: 0.1.7 parameters: dict mutation_rate: 1e-08 recombination_rate: None time_units: None progress: None population_size: 10000.0 eps: 1e-06 outside_standardize: True ignore_oldest_root: False probability_space: logarithmic num_threads: None cache_inside: False command: inside_outside environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict tskit: dict version: 0.5.8
06 September, 2024 at 02:25:05 AM	tsdate	0.1.7	preprocess_ts	Details dict schema_version: 1.0.0 software: dict name: tsdate version: 0.1.7 parameters: dict minimum_gap: 1000000 remove_telomeres: True filter_populations: False filter_individuals: False filter_sites: False delete_intervals: list list 0 209048.0 list 44004282.0 44077779.0 command: preprocess_ts environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict tskit: dict version: 0.5.8
06 September, 2024 at 02:25:01 AM	tskit	0.5.8	simplify	Details dict schema_version: 1.0.0 software: dict name: tskit version: 0.5.8 parameters: dict command: simplify TODO: add simplify parameters environment: dict os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: 3.11.9 libraries: dict kastore: dict version: 2.1.1
06 September, 2024 at 02:25:00 AM	tsinfer	0.3.3	infer	Details dict schema_version: 1.0.0 software: dict name: tsinfer version: 0.3.3 parameters: dict mismatch_ratio: None command: infer environment: dict libraries: dict zarr: dict version: 2.17.2 numcodecs: dict version: 0.12.1 lmdb: dict version: 1.4.1 tskit: dict version: 0.5.8 os: dict system: Linux node: node1 release: 5.15.0-58-generic version: #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 machine: x86_64 python: dict implementation: CPython version: list 3 11 9

To cite this software, please consult the citation manual: https://tskit.dev/citation/

Calculate nucleotide diversity as before:

# the last index is a simply a 2 step starting from position 1
compara_diversity = ts_compara.diversity(windows=create_windows(ts_compara))[1::2]
compara_diversity[:10]

array([0.42881531, 0.48983504, 0.49397942, 0.27006834, 0.45935871,
       0.24743806, 0.14849395, 0.3905915 , 0.46073859, 0.4897602 ])

Are this values similar to the values calculated using VCFtools?

np.isclose(compara_diversity, vcftools_diversity["PI"], atol=1e-6).all()

np.True_

So nucleotide diversity is the same in both cases (using the REF as ancestral allele and using the compara as ancestral allele). Calculate diversity using branch:

# the last index is a simply a 2 step starting from position 1
compara_diversity_branch = ts_compara.diversity(mode='branch', windows=create_windows(ts_compara))[1::2]
compara_diversity_branch[:10]

array([480.63266349, 363.72632856, 356.52374401, 376.91612567,
       381.58518179, 415.57735315, 419.10475683, 419.5249502 ,
       419.5249502 , 419.5249502 ])

Try to plot the tow different diversities with vcftools output:

plt.scatter(compara_diversity, vcftools_diversity["PI"])

<matplotlib.collections.PathCollection at 0x7f6f3ec6e5d0>

plt.scatter(compara_diversity_branch, vcftools_diversity["PI"])
plt.xlim(0, 1000)

(0.0, 1000.0)

../_images/1bfc1de96c6e6def34f05eb3392f763f54495246b1017e6961bf837cf98e81f4.png

Nucleotide diversity

Contents

Nucleotide diversity#

Using REF as ancestral alleles#

EST-SFS output as ancestral alleles#

Compara output as ancestral alleles#