Dataset for TS inference#
Diploid, phased VCF files with varying numbers of individuals. All are BGZIP-compressed and have a CSI index. A key to the populations is in the file popKey.
These are for TS inference with SINGER, threads, and tskinfer/tsdate. For each file, the REF allele can be assumed to be ancestral. The imputed files with Maj in their name have the REF allele set to the major allele. This is to assess the effect of wrongly specified ancestral alleles.
Complete VCF
tsm100M300I.vcf.gz(exported from msprime with ref allele = ancestral)
Array-density files
ts300I2k.vcf.gzts300I25k.vcf.gz
Imputed datasets for TS inference
panelLarge25kImputed.vcf.gzpanelLarge25kMajImputed.vcf.gzpanelLarge2kImputed.vcf.gzpanelLarge2kMajImputed.vcf.gzpanelNoBLarge25kImputed.vcf.gzpanelNoBLarge25kMajImputed.vcf.gzpanelNoBLarge2kImputed.vcf.gzpanelNoBLarge2kMajImputed.vcf.gzpanelSmall25kImputed.vcf.gzpanelSmall25kMajImputed.vcf.gzpanelSmall2kImputed.vcf.gzpanelSmall2kMajImputed.vcf.gzpanelNoBSmall25kImputed.vcf.gzpanelNoBSmall25kMajImputed.vcf.gzpanelNoBSmall2kImputed.vcf.gzpanelNoBSmall2kMajImputed.vcf.gz
Fix file names#
Files have been downloaded and extracted in data/toInfer/ folder: there’s a
repetition of “vcf” in the file names. You can fix this by creating a simple script
like this in the data/toInfer/ folder:
#!/bin/bash
# replace .vcf.vcf extensions with .vcf
# ex. panelLarge25kImputed.vcf.vcf.gz -> panelLarge25kImputed.vcf.gz
set -euo pipefail
for file in $(ls *.vcf.vcf.*); do
newname=$(echo "$file" | sed 's/\.vcf\.vcf/\.vcf/')
mv "$file" "$newname"
done
Make the script executable with chmod +x fix_filenames.sh and run it with
./fix_filenames.sh.
Create a TSV file with sample information#
Those files are a simulation of 1600 individuals from the original 2405 individuals
in the msprime simulation. We need to create a TSV file with sample information
using FID and IID columns: you can create it with an utility script in this
folder:
python scripts/createFID-IID.py \
--indiv-list data/toInfer/popKey \
--directory data/toInfer
Create a fake FASTA file#
We need to create a fake FASTA in order to exploit bcftools reheader to add
contig length information to the VCF files: this is required by create_tstree
script installed in this project (and to be used by the cnr-ibba/nf-treeseq
Nextflow pipeline). This can be done again with an utility script in this folder:
python scripts/fakeFastaFromVCF.py \
--vcf data/toInfer/tsm100M300I.vcf.gz \
--output data/toInfer/tsm100M300I.fa.gz
Call nextflow pipeline (tskit reference approach)#
You can create two nextflow configuration files for running the cnr-ibba/nf-treeseq
pipeline: see config/samples_toInfer.csv and config/samples_toInfer-reference.json to
see how to set it up. Then you can call the pipeline with:
nextflow run cnr-ibba/nf-treeseq -r v0.3.0 \
-config config/custom.config -profile ibba,core -resume \
-params-file config/samples_toInfer-reference.json
Call nextflow pipeline (threads approach)#
You need also two configuration files for running the cnr-ibba/nf-treeseq
pipeline using threads approach, the config/samples_toInfer.csv file is the
same as before, but you need to create a new JSON configuration file: see
config/samples_toInfer-threads.json to see how to set it up. You can call the
pipeline with:
nextflow run cnr-ibba/nf-treeseq -r v0.3.0 \
-config config/custom.config -profile ibba,core -resume \
-params-file config/samples_toInfer-threads.json
Call nextflow pipeline (threads with fit to data)#
You need to create another JSON configuration file for running the cnr-ibba/nf-treeseq
pipeline with threads and fit to data option enabled:
see config/samples_toInfer-threads-fit.json to see how to set it up.
You can call the pipeline with:
nextflow run cnr-ibba/nf-treeseq -r v0.3.0 \
-config config/custom.config -profile ibba,core -resume \
-params-file config/samples_toInfer-threads-fit.json
Warning
The threads_fit_to_data option requires more resources (RAM and time) than the other
pipelines, so make sure you have enough resources allocated.
Error
It was not possible to run the fit to data option on the smallest datasets (ts300I2k)
using the current parameter set.