To follow this and other practicals, a working *NIX environment is required. We suggest to use a Linux distribution, MacOS or Windows with WSL2.
If you don’t have a working environment, you can use GitPod, a cloud-based IDE that allows you to work on your projects from anywhere. We suggest:
Go to https://www.gitpod.io/ and select the Try for free button
Add your linkedin account to get 50 hours of free usage per month
See your resource usage at https://gitpod.io/billing
Click here to create a new workspace for this project!
Take a look at VS Code tutorial, for example this one

You can install the GitPod CLI to manage your workspaces from the terminal
/workspace directory are preserved.For scientific computing, a command-line interface (CLI) is often essential. This means typing out commands instead of using a graphical user interface (GUI).
The filesystem is organized in a tree-like structure, with the root directory / at the top.
/ is the root directory./home contains user directories./usr contains user programs./bin contains essential binaries./etc contains system configuration files./var contains variable data./tmp contains temporary files./workspace specific to GitPod.
/.
/, /home/user/file.txt, /home/user/data/.file.txt, user/data/file.txt.. refers to the current directory (e.g., ./file.txt).. refers to the parent directory (e.g., ../file.txt). Can be chained multiple times (e.g., ../../file.txt)Here are some basic commands to navigate the filesystem: each command can accept a path and additional option(s) as an argument. The general rule is:
command [option(s)] [path]
pwd: Print current working directory (absolute path).ls: List files and directories.cd: Change directory.mkdir: Make directory.rmdir: Remove directory (safer).touch: Create an empty file - set the timestamp of a file to the current time.rm: Remove files.mv: Move or rename files.cp: Copy files.ls useful optionsls shows files and directories in the current directory. You can provide a path to list files in a different directory. Here are some useful options:
ls -l: List files with details.ls -a: List all files, including hidden ones.ls -lh: List files with human-readable sizes.ls -t: List files sorted by modification time.ls -S: List files sorted by size.ls -R: List files recursively.ls -r: List sorted files in reverse order.ls -1: List files in a single column.You can combine options, e.g., ls -lhSr /home/user/data to list files in /home/user/data folder with human-readable sizes ordered by size in ascending order (bigger files on bottom).
cd: (without any arguments) change to your home directorycd -: change to the previous directorymkdir -p: create a directory with its parents if they do not existrm -r: remove directories and their contents recursively (use with caution)rm -i: prompt before removing filesmv -i: prompt before overwriting filescp -r: copy directories and their contents recursivelycp -i: prompt before overwriting filesCan you guess the difference between rm -r and rmdir?
mydirmydir directorymyfile.txtmyfile.txt to the home directorymydir directoryWildcards are characters that help match file names based on patterns. Ex:
*: Matches any number of characters (file* matches file1, file2, fileverylong etc.)?: Matches a single character (file? matches file1, file2, but not fileverylong)[ ]: Matches any character within the brackets (file[12] matches file1, file2, but not file3 or fileverylong. file[1-9] to match file with any digit){ }: Matches any of the comma-separated word (file{1,2,verylong} matches file1, file2 and fileverylong)Some characters have special meanings in the shell:
~: expands to the home directory (cd ~, cd ~/data)$: refers to an environment variable (echo $HOME, echo $PWD);: separates commands in one line (execute commands with this order e.g. cd /tmp; ls first change to /tmp then list files)\: escapes the next character (ls file\ with\ spaces.txt)': preserves the literal value of all characters enclosed (echo 'Today is $(date)' will print Today is $(date))": preserves the literal value of all characters enclosed, but allow for variable expansion, command substitution, and escape sequences (echo "Today is $(date)" will print Today is <current date>)#: comments the rest of the line (# this is a comment). Not executed.Each file has three types of permissions: read, write, and execute. These permissions are set for three types of users: owner, group, and others.
r: read permissionw: write permissionx: execute permission-: no permissionYou can inspect permissions using ls -l, near the file name. For example, rwxr-xr-- means:
You can change file permissions using chmod command. The general syntax is:
chmod [options] mode file
where mode can be:
u: user (owner)g: groupo: othersa: all (u, g, o)+: add permission-: remove permission=: set permissionFor example, to give execute permission to the owner of a file:
chmod u+x file
Every process in Unix has three standard streams:
By default, stdin is the keyboard, and stdout and stderr are the terminal. You can redirect these streams:
>: Redirect stdout to a file (ls > files.txt)>>: Append stdout to a file (ls >> files.txt)2>: Redirect stderr to a file (ls non_existent_file 2> errors.txt)&>: Redirect both stdout and stderr to a file (ls non_existent_file &> output.txt). Its equivalent to > output.txt 2>&1<: Redirect stdin from a file (wc -l < files.txt)Pipes (|) connect the stdout of one command to the stdin of another. For example:
ls -l | wc -l
This command lists files in the current directory and counts the number of lines in the output.
You can chain multiple commands using pipes:
ls -l | grep myfile | wc -l
This command lists files in the current directory, filters lines containing myfile, and counts the number of lines.
Environment variables are key-value pairs that store information about the environment. Some common environment variables:
HOME: Home directoryPATH: List of directories to search for executable filesPWD: Present working directoryOLDPWD: Previous working directoryUSER: Current userSHELL: Current shellYou can access environment variables using $, for example echo $HOME prints the home directory (to stdout).
You can use environment variables in scripts or commands, for example: cd $HOME or cp $OLDPWD/file.txt .
Aliases are shortcuts for commands. You can define aliases in the shell configuration file (e.g. ~/.bashrc). For example:
This command creates an alias ll for ls -l. You can use ll instead of ls -l.
You can access the content of a file using cat, less, more, or head and tail commands.
cat: Concatenate and display file contentless: Display file content page by pagemore: Display file content page by pagehead: Display the first lines of a filetail: Display the last lines of a fileFor example, to display the first 10 lines of a file:
head file.txt
You can search for text in files using grep command. The general syntax is:
grep [options] pattern file
where pattern is the text to search for. For example:
grep 'pattern' file.txt
This command searches for pattern in file.txt and prints matching lines.
You can use regular expressions in grep to search for more complex patterns. For example:
grep -E 'pattern1|pattern2' file.txt
This command searches for pattern1 or pattern2 in file.txt.
You can find files in the filesystem using find command. The general syntax is:
find [path] [options]
where path is the directory to search in. For example:
find /tmp -iname '*.txt'
This command searches for files with .txt extension in /tmp directory. More options:
-name: Search by exact name-type: Search by file type (e.g. f for file, d for directory)-size: Search by file size (+ for larger, - for smaller. e.g. +1M)-exec: Execute a command on found files (requires {} as a placeholder and \; at the end of the command)There are many utilities available in Unix-like systems. Some common utilities:
awk: A powerful text processing toolsed: A stream editor for filtering and transforming textcut: Extract columns from each line of filessort: Sort lines of text filesuniq: Report or omit repeated lineswc: Print newline, word, and byte counts for each filediff: Compare files line by linefile: Determine file typedu: Estimate file space usageman: Display manual pageswhich: Locate a commandls command and redirect the output to files.txt.ls and wc commands..txt extension in the current directory using find command.files.txt using head command.files.txt to all_files.txt using mv command.ls a non-existent file and redirect the error to errors.txt.all_files.txt and errors.txt using cat command and redirect the output to all_files_errors.txt.bash (man bash). Search for Commands for Moving section: How I can move to the beginning of the line? How I can move to the next word?LSPAN24-practical1.qmd, grep for ## characters (mind # is a comment character), then order titles alphabetically using sortcurl and wget are command-line tools for transferring data with URLs. Some differences:
curl: Supports multiple protocols (HTTP, HTTPS, FTP, etc.), more flexible, but less user-friendly.wget: Supports HTTP and FTP, more user-friendly, but less flexible.curl is more suitable for scripting and automation.You can use curl and wget to download files from the web. For example:
curl -O https://example.com/file.txt
This command downloads file.txt from https://example.com to the current directory.
You can compress and decompress files using gzip, bzip2, and xz commands. For example:
gzip file.txt
This command compresses file.txt to file.txt.gz. To decompress:
gunzip file.txt.gz (or gzip -d file.txt.gz)
You can use bzip2 and xz commands similarly. For example to compress:
bzip2 file.txt
xz file.txt
To decompress:
bunzip2 file.txt.bz2 (or bzip2 -d file.txt.bz2)
unxz file.txt.xz (or xz -d file.txt.xz)
You can create (-c) and extract (-x) tar archives using tar command. For example:
tar -cvf archive.tar file1 file2
This command creates archive.tar containing file1 and file2. This archive is not compressed and have the same size of the sum of file1 and file2 sizes. To extract:
tar -xvf archive.tar
You can compress and extract compressed tar archives in one step. For example:
tar -czvf archive.tar.gz file1 file2
This command creates archive.tar.gz containing file1 and file2. To extract:
tar -xzvf archive.tar.gz
Mind to the z option for gzip, j for bzip2, and J for xz. f need to be followed by the archive name.
Credits programmerhumor.io
Credits xkcd.com
There are many text editors available in Unix-like systems. Here some common editors available in terminal:
nano: Simple and user-friendly text editorvim: Powerful and highly configurable text editoremacs: Extensible and customizable text editorTip
GitPod users have access to Visual Studio Code, a powerful and highly configurable text editor.
Virtual environments are isolated environments for software development. They allow you to install dependencies and packages without affecting the system-wide installation. Some common tools for creating virtual environments:
We use conda command to manage environments with conda / miniconda:
mamba if you have mamba installed.--name: required to specify the name of the environment.We can also create an environment from a file. Note the env before the create command:
--file: specify the file containing the environment specifications--name: override the name of the environment in the fileconda env list: list all environmentsconda activate <env_name>: activate an environmentNormally the prompt will change to show the active environment. Default installation have the base environment active at login
To deactivate an environment, use conda deactivate:
You can activate more than one environment at a time. The last activated environment will be active. When you deactivate an environment, the previous environment will be activated.
defaults: the default channel for conda packagesR: a channel for R packagesbioconda: a channel for bioinformatics softwareconda-forge: a community-driven collection of conda packages# add a channel to the list of channels
conda config --add channels conda-forge
# add a channel to the list of channels
conda config --add channels biocondaTip
GitPod users: we have set up channels required for the practical with the suggested priority settings. You can see the settings in the .condarc file in your home directory. See more information on conda Managing Channels
Use conda search to search for packages in the configured channels. Wildcards can be used in the search:
# search for a package in the configured channels
conda search samtools
# search for a package in a specific channel
conda search --channel bioconda samtoolsTip
GitPod users: the channels are already configured for the practicals. You can search any package in the configured channels without specifying the channel.
Search for the package samtools in the bioconda channel:
r-base package.conda install: install packages in the active environment
--name: specify the environment to install the package--file: specify a file with the list of packages to install# install a package in the active environment
# base environment is read-only for GitPod users!
conda install pandas
# install a package in a specific environment
conda install --name python3.10 pandas
# install packages from a file in the active environment
# (python format)
conda install --file requirements.txtCreate an environment with samtools, tabix and bcftools packages. Activate the environment and check if the packages are installed. Then install seqkit package in the same environment.
samtools executable is located?samtools executable is available. Why is it not available?Tip
which to find the location of the executable.$PATH.conda env export: export the environment to a file--name: specify the environment to export--file: specify the file to export the environment to--no-builds: exclude the build string from the exported file# export the active environment to a file (using STDOUT)
conda env export > environment.yml
# export a specific environment to a specific file
conda env export --name python3.10 --file environment.ymlExport an environment to a file; then export the same environment with --no-builds option to another file. Compare the two files with diff (try diff -y for a more readable output).
conda list: list all packages installed in the active environmentconda install pandas=1.3.3conda-forge channel usage if possibleconda list --revision: list all revisions of the environmentconda install --revision 1: revert environment to a previous revisionconda env remove --name <env_name>: remove an environmentconda create --clone <env_name> --name <new_env>: clone an environmentconda clean --all: clean the cache and unused packagesLet’s collect some genome data to make an example. We will use the CLI tools made available by NCBI, datasets and dataformat, to collect ARS-UCD1.2 data from the NCBI database. Next, we will use seqkit to manipulate file headers and then we will bgzip to pack the sequence files.
Create one conda environment with ncbi-datasets-cli, jq, seqkit and tabix packages
Go the NCBI Datasets page and search for cow (Bos taurus (cattle) will be suggested). You will open the new page for NCBI Taxonomy ID: 9913.
Click on ARS-UCD2.0 link, below the Genome section an over the Download button (don’t download the genome from this page)
In the practical we will use the ARS-UCD1.2 (GCF_002263795.1) version of the genome, however the latest version is ARS-UCD2.0 (GCF_002263795.3):
datasets is a command-line tool that is used to query and download biological sequence data across all domains of life from NCBI databases. See the documentation for more information.
For example, retrieve the same information as before using the accession number: pipe the result to jq to format the output:
The same command can be use to download the genome data: paste the command you’ve copied from the NCBI Taxonomy page and add two additional options:
datasets download genome accession GCF_002263795.1 \
--include gff3,rna,cds,protein,genome,seq-report \
--dehydrated --filename ARS-UCD1.2.zipNote
The \ (escape character) is used to break the command in multiple lines: it is not necessary if you paste the command in one line: but if you paste the command in multiple lines it prevents the command from being executed before you finish typing it.
--dehydrated: download the data in a dehydrated format: the data is downloaded in a format that can be rehydrated using the datasets tool. This is required if you need to download a lot of data. See Download large genome data packages for more information--filename: specify the name of the file to download the data to--include: specify the data types to download: if we are not interested in all data types we can exclude some of themNow unzip the downloaded archive in a new directory (since the archive will place stuff in the current directory):
-d: specify the directory to extract the files toTake a look to see the downloaded data: we have not any data! the only data we have are some metadata and the URL were files can be downloaded.
Is now time to rehydrate data:
--directory: this is the top level directory in which you have decompressed data with unzipNow the data should be downloaded in the same directory where the metadata is located. You can check the data with ls or tree:
tree: list the directory structure in a tree-like format
-p: show permissions-h: show sizes in human readable format-C: colorize the outputTake a minute to look at the files you have downloaded, especially the genome sequence: how many sequences are in the file? what is the format of the file?
We can use grep '>' to extract the sequence names from a fasta file and then counting the number of lines, however we can use a fasta/fastq manipulation program like seqkit, which can do this and much more:
seqkit stats: show statistics for the input fileThere are a lot of sequences in the file, you can inspect the sequence names with
seqkit seq: transform sequences (extract ID, filter by length, remove gaps, reverse complement…)
-n: only print names/sequence headers-i: print IDs instead of full headersThe sequence_report.jsonl file contains information about the sequences in the genome assembly. We can use jq to inspect the file, or we can use the dataformat command to transform this information in a table:
dataformat tsv: transform the input data in a tab-separated format
genome-seq: the type of source data to transform--inputfile: the input file to transform--fields: specify the fields to include in the outputSuppose we need to get rid of un-assembled sequences from the genome: we can use seqkit to extract sequences by name. First, extract the sequences we want by ids:
jq -c 'select(.role == "assembled-molecule")' sequence_report.jsonl \
| dataformat tsv genome-seq --fields refseq-seq-acc > ids.txt-c: compact output (required by dataformat as input)select(.role == "assembled-molecule"): select only sequences that are assembled molecules by their rolerefseq-seq-acc: the field to extract from the input dataNow we can use seqkit to extract the sequences by ids:
awk, cut or grep instead of jq?Now the sequence names are the RefSeq accession numbers: Let’s rename the sequences to include the chromosome name as id: this could be done again with seqkit, with a text file with the old name as key and the new name as value:
jq -c 'select(.role == "assembled-molecule")' sequence_report.jsonl \
| dataformat tsv genome-seq --fields refseq-seq-acc,chr-name > alias.txtThis is the same command we use before, but now we are extracting also the chromosome number. Now we can use seqkit to rename the sequences:
seqkit replace -p '^(\S+)(.+?)$' -r '{kv} \$1\${2}' -k alias.txt \
-o ARS-UCD1.2_chromosomes_renamed.fna ARS-UCD1.2_chromosomes.fnaseqkit replace: replace patterns in sequences
-p: the pattern to search for-r: the replacement pattern-k: the key-value file with the replacements-o: the output fileThe last operation can be to compress the genome sequence file: we can do this with bgzip in order to have a compressed file that can be indexed with other software like samtools:
Check that sequence names are in the proper format
seqkit stats ARS-UCD1.2_chromosomes_renamed.fna.gz
seqkit seq -n ARS-UCD1.2_chromosomes_renamed.fna.gzCan you optimize the previous commands to avoid creating intermediate files?
Tip
You can use pipes to connect the output of a command to the input of another command. Ideally, the original file downloaded from NCBI should not be modified (but can be compressed)

Livestock pangenomes 2024 - Practical 1 - 2024/07/22