T 4.76 Failure of administration servers for virtualisation systems

Several virtualisation servers may be used in order to design a virtual infrastructure. For this, the virtualisation servers are connected in such a way that the virtual IT systems running on them are always executed on the virtualisation server that is able to provide the optimal performance for the respective IT system. If a virtualisation server is able to provide a running virtual IT system with more resources (dynamic assignment of resources, e.g. Citrix XenServer Workload Balancing or VMware Dynamic Resource Scheduling), it is even possible to migrate this IT system to the IT system with the available resources with the help of a migration (Live Migration).

Additionally, the availability of the virtual IT systems can be increased by high-availability mechanisms such as the automatic restart of failed virtual machines. These functions require a central administration server for the majority of the virtualisation products coordinating the operation of the individual virtual machines and the virtualisation servers. Virtualisation products capable of using such a central administration server include Citrix XenServer, Microsoft Hyper-V, or VMware ESX, for example The administration server (Citrix XenCenter, Microsoft System Center Virtual Machine Manager, SUN Management Center, or Vmware vCenter) is normally also equipped with a monitoring component that can be used to monitor the function of the virtual IT systems and the virtualisation servers.

Since the administration server controls and administrates all functions of a virtual infrastructure, a failure of this administration system results in the loss of the capability of performing any configuration changes to the virtual infrastructure. During this period, the administrators cannot react to occurring problems such as resource bottlenecks or the failure of individual virtualisation servers, nor can they integrate a new virtualisation server into the infrastructure and/or create new virtual IT systems.

Functions such as Live Migration and therefore the dynamic assignment of resources for individual guest systems are no longer available either, since the instance coordinating such functions is no longer operational. As a consequence, the virtual infrastructure is no longer able to automatically react to resource bottlenecks which has adverse effects on both the performance and the availability of individual virtual IT systems. This is particularly applicable if the resources of the virtualisation servers have been overbooked.

Additionally, the administration server serves for monitoring the virtualisation servers and the virtual IT systems operated on these servers. If the administration server or its monitoring component provides incorrect data or no data at all, the administrators are no longer capable of appropriately monitoring the functionality of the virtual infrastructure. Thus, there is the risk that resource bottlenecks in the virtual infrastructure remain unnoticed and an expansion of the virtual infrastructure is provided too late. It may also be possible that the failure of individual virtual IT systems is noticed too late if the monitoring function of the virtual infrastructure failed.

Furthermore, the failure of virtualisation servers may even remain unnoticed if the IT systems running on this server have been migrated to another virtualisation server and therefore no services fail in the computer centre, but the failure is not indicated due to an error in the administration and monitoring software. The related reduction of redundancy may massively reduce the overall availability of the virtual infrastructure.

Example:

An organisation operates several virtualisation servers consolidated in two farms. In each these farms, several virtual IT systems are operated. The virtualisation servers were distributed to two farms, since certain virtual IT systems must not be operated together with other IT systems due to different protection requirements.

During the planning phase of the two farms, the number of virtualisation servers required in each case was determined based on a forecast of the future performance requirements. After a while, the forecast turns out to be incorrect. It is determined that an additional virtualisation server is required in the first of the two farms in order to cover the performance requirements of the virtual IT systems.

After having analysed the performance data of the second farm, the administrators of the virtualisation servers determine that the utilisation of these serves is far lower than predicted in the performance forecast. Therefore, it is decided to not to procure a new virtualisation server, but to migrate one virtualisation server from the second farm to the first one.

Now, the virtual IT systems on the virtualisation server to be migrated to the first farm are migrated to other systems and the server is added to the first farm. As a consequence, the resources of the second farm are massively overbooked This was not to be expected according to the results of the performance analysis.

The reason for the massive losses in performance of the virtual IT systems in the second farm was that the administration system for this farm incorrectly processed the performance data of the individual virtualisation servers and indicated values for resource consumption that were significantly lower than they should have been.