Analysis of insertions in the genomes and proteins of coronaviruses. Illustration: AvH. (2020). Covid-19 cracked
This page provds the instructions to install the software environment used for the analysis of SARS-CoV-2 origins, and some basic instructions to run the analyses.
git clone https://github.com/jvanheld/coronavirus_insertions.git
The code can then be updated as follows
cd coronavirus_insertions
git pull
The whole software environment required to reproduce these analyses can be easily installed with miniconda
, whcih needs to be installed beforehand.
## List the targets
make -f scripts/makefiles/01_software-environment.mk
## Install the environment
make -f scripts/makefiles/01_software-environment.mk install_env
The software environment can then be loaded with the command
conda activate covid-19
Additional tasks are described in the help message
make -f scripts/makefiles/01_software-environment.mk
The analyses can be redone by combining
A series of scripts in the makefiles
folder. The makefiles are numbered to indicate their order (there are some dependencies between scripts). Running a makefile without specifying a target will list the available targets and their short description.
Some R markdown notebooks in the reports
folder. The name of each Rmd file indicates its goal.
Before each working session, you need to restart the conda
environment.
First, list the targets:
make -f makefiles/01_software-environment.mk
Targets:
links list relevant links for this analysis
install_env install the conda environment
update_env update the conda environment
start_env start the conda environment
Beware: the target start_env
does not actually start the environment, but indicates the commands required to start it.
## List the commands required to start the environment
make -f makefiles/01_software-environment.mk start_env
## Run these commands
source /Users/jvanheld/miniconda3/etc/profile.d/conda.sh
conda activate covid-19
This scripts runs blastn
to find matches between the HIV genome and all the Betacoronaviruses available at NCBI. The goal is to evaluate hte claim made on some media that SARS-CoV-2 contains insertions from HIV genome. We showed that these matches are not statistically significant (all of them have an e-value higher than 1).
The commands to run the phylogenetic analysis of coronavirus genomic sequences can be listed as follows.
make -f scripts/makefiles/02_genome-analysis.mk
The script includes parameters that can be modified to address specific querries or to tune the computing according to your local configurtion.
make -f scripts/makefiles/02_genome-analysis.mk list_param
In particular the variable PHYML_THREADS
should be adapted to the number of CPUs of your computer.
The aim of this program is to retrieve from uniprot all the available sequences of spike proteins belonging to the specified taxa, to align them and to identify the insertions in one of these proteins (by default the spike of SARS-CoV-2).
The commands are specified in the make file make -f scripts/makefiles/03_protein-alignments.mk
.
The list of targets can be obtained with the following command.
make -f scripts/makefiles/03_protein-alignments.mk
You should first run one of the “uniprot_” functions, then “align_muscle” and finally “identify_insertion”. For example:
make -f scripts/makefiles/03_protein-alignments.mk uniprot_sarbecovirus
make -f scripts/makefiles/03_protein-alignments.mk align_muscle
make -f scripts/makefiles/03_protein-alignments.mk identify_insertion
Results will be found in the “results/spike_protein” folder (namely the multiple alignment file in several formats, and a .csv file describing the position of the insertions in the reference protein).
The color_insertions.pml program enables to visualize the insertions in SARS-CoV-2 spike on a 3D structural model.
pymol scripts/pymol/color_insertions.pml
ACE2 is the receptor of SARS-CoV-2. To determine which animals are susceptible to be infected by SARS-CoV-2 or similar viruses, we gathered the ACE2 sequences of numerous animals in the data/ACE2/ACE2.fa file. Our aim is to align them, construct a phylogenetic tree, and to determine the similarity of each protein with the human protein on the residues involved in spike binding.
To build the multiple alignment (saved in fasta and phylip format) between the ACE2 sequences:
make -f scripts/makefiles/05_ACE2_analysis.mk align_muscle_fasta_phylip
Once this is done, the corresponding tree can be built with:
make -f scripts/makefiles/05_ACE2_analysis.mk phyml_ACE2
and the comparison with the human protein can be performed with:
make -f scripts/makefiles/05_ACE2_analysis.mk compare_ACE2_with_human
## Define your login on the IFB-core cluster
IFB_LOGIN=[your_login]
## open a connection to the cluster
ssh ${IFB_LOGIN}@core.cluster.france-bioinformatique.fr
cd coronavirus_insertions
module load conda
conda activate covid-19
Never run the tasks on the login node!
## Run genome alignments
srun --cpus=50 --mem=32GB --partition=fast \
make -f scripts/makefiles/02_genome-analysis.mk \
PHYML_THREADS=50 TIME='' \
Sgenes_around-cov2