T 4.75 Failure of the network infrastructure of virtualisation environments

Several virtualisation servers can be consolidated to form a so-called virtual infrastructure. In such a virtual infrastructure, the virtual IT systems can be distributed to the individual virtualisation servers as required. Furthermore, it is possible to migrate the virtual machines between the virtualisation servers. For some products, this may also be performed while the virtual IT system is being executed (examples: Microsoft Hyper-V Live Migration, VMware VMotion, XEN LiveMigration). Such a process, hereinafter referred to as Live Migration, is normally transparent for the virtual IT system, i.e. it does not notice the migration process. Many further functions of a virtual infrastructure build upon this migration technology. These include functions such as the dynamic assignment of processor and main memory resources. Here, the virtual IT system is always migrated to the virtualisation server that is able to optimally provide the required resources. This way, a virtual IT system is always provided with the best possible assignment of resources.

Furthermore, there are virtualisation products compensating the failure of a virtualisation server by automatically restarting the virtual IT systems also affected by the failure on another virtualisation server.

In order to implement the described technical possibilities, a communication network is required between the virtualisation servers involved in order to coordinate these functions (automatic restart, Live Migration). If failures occur in this network, the functions coordinated with the help of this network will fail as well.

A failure in the communication between virtualisation servers may cause a Live Migration to be cancelled. This could cause the failure of mechanisms for dynamic load balancing if a virtual machine is to be migrated to a different destination server due to a resource bottleneck As a consequence, the resource bottleneck on the source server that cannot be eliminated results in a limited availability of the non-migratable IT system.

In order to increase the availability of virtual IT systems, several virtualisation servers can be connected to form a cluster. The systems participating in such a server cluster require smooth communication amongst each other. The systems use this communication for mutual monitoring and for checking whether the virtual IT systems running on their partners are still available (Heartbeat), for example. If one of the partners in the cluster fails, the IT systems which also failed are restarted on a different virtualisation server, as far as possible.

If the communication network of the cluster fails, for example, due to a hardware error on a switch, the failure compensation function of the cluster is out of order. The virtual IT systems on the virtualisation servers, which are also members of the cluster, may also be endangered regarding their availability.

For the rest, the communication network between the systems participating in the high-availability cluster assumes additional important functions along with the functions mentioned above: If the communication between several systems of a cluster fails simultaneously, every system must be able to decide whether it or the other systems are affected by the failure (isolation problem). If two or more virtualisation servers participating in a high-availability cluster started a virtual IT system separately several times, the data representing this virtual system could be damaged. This may render the virtual IT system useless. There may also be failures if one and the same IT system exists several times in the network (e.g. by duplicate IP or MAC addresses).

Connection of storage networks

Virtual IT systems are normally represented physically by a host of files. Along with the configuration of the virtual IT system, these files also contain the containers for virtual hard disks, for example. If snapshots of the virtual IT systems are created in any, even running operating condition, the virtualisation server also stores the data thus produced in files. These files may either be stored to the virtualisation server itself or in the related central storage network.

Virtual server environments consisting of several virtualisation servers are often connected to central storage networks so that access to the files representing the virtual IT systems is possible from several locations. If the connection to these storage resources is interrupted, this effects the virtual IT systems as if the hard disk was removed from a physical server during live operations. Since frequently more than one virtual IT system is stored to storage resources in a storage network, the operational reliability of many virtual IT systems is endangered in the event of a failure. A failure may cause file system inconsistencies in the virtual IT systems and virtualisation servers affected by the failure that may require comprehensive recovery measures.