T 4.74 Failure of IT components in a virtualised environment

Within a classic IT infrastructure, server operating systems and their services, but also the operating systems of the workstation computers, are executed on physical IT systems. The infrastructure components (network components, storage networks, and the like) required for operating the server systems are also distributed to different physical IT systems.

In a virtualised environment, however, the server systems and parts of the required infrastructure components are largely provided by the virtualisation servers as stand-alone server instances. Therefore, if a virtual server accesses the network, for example, it does not access a physical IT system such as a switch, but a component provided by the virtualisation server that is only operated as software, but not as stand-alone hardware.

If a physical IT system fails, it is often possible to continue to work with the remainder of the systems. The services provided by the failed server cannot be used any longer, but this does not necessarily affect all other installed servers. For example, if a database server failed, access to the file server may still be possible. Therefore, this does not affect the entirety of business processes supported by the information system.

In contrast, numerous and various instances of virtualised IT systems (guests) are normally consolidated technically on a few physical machines in a virtualised IT infrastructure. This significantly increases the effects on the availability in the event of malfunctions of a virtualisation server. If physical components of a virtualisation server are damaged or if there is a malfunction regarding its operating system, all virtual IT systems running on that server will be affected.

If an IT system fails, the data processed by this system may be damaged. The time and expense required to re-commission the system may be higher, since data recovery from the data backup may be required. Data may also have been lost irretrievably. If several virtual IT systems fail simultaneously due to an error of a virtualisation server, this increases the likelihood of at least one of the failed systems suffering such damage. Therefore, such a case may result in a longer interruption of operations when compared to the failure of only one IT system.

Many services depend on each other in computer centre operations. For example, an email system requires a directory service in order to assign the recipient addresses to the mailboxes. A task management system requires the email system in order to process incoming and outgoing tasks. Furthermore, the system automatically creates tasks using the ERP system in order to support the processing of the customers' orders. Moreover, the ERP system accesses the warehouse management database in order to monitor the stocks.

The failure of individual components of the information system may cause the partial failure of services provided in the computer centre. If several IT systems are operated as virtual IT systems on one virtualisation server, several components of an information system fail simultaneously with the virtualisation server. This may result in stronger adverse effects on IT operations when compared to classic, non-virtualised computer centre operations.

Example:

A medium-sized company decided to use a virtualisation solution. It was planned to procure several very powerful servers and to significantly reduce the number of physical systems.

On the virtualisation servers, IT systems and their services were distributed according to aspects such as processor load and memory consumption. In doing so, it was considered how the virtual IT systems could be optimally distributed to the virtualisation servers.

The company uses an email system based on a directory service. Additionally, an accounting system distributed to an application server and a database server is operated. The ERP and warehousing systems used additionally also use the accounting database for exchanging data.

Since the database server of the accounting system and the email server belong to the IT systems with the highest performance requirements, it was decided to operate these on separate virtualisation servers. This is performed in order to prevent any mutual interference of the systems during operation. The in-depth analysis of the systems' performance requirements showed that an optimal consolidation effect can be achieved if the virtual IT systems are distributed as follows:

First virtualisation server: database, directory service

Second virtualisation server: email system, accounting system

Third virtualisation server: ERP system, warehousing system

The first virtualisation server failed due to a damaged electrolyte capacitor on the mainboard of this server. This server contained the database for the accounting system and the directory service of the company separated in virtual machines.

The failure of this physical server had wide-ranging consequences for IT operations as a whole. The application servers of the accounting department, as well as the warehouse and logistics and enterprise resource planning departments are located on other virtualisation servers, but they depend on a data exchange with the database in order to work properly. Central processes failed completely in the company so that the delivery of customers' orders stopped and a loss of production with a duration of several hours had to be accepted as a consequence of the failure of the ERP and warehousing system.

Furthermore, it was not possible to immediately inform the customers of the company of the loss of production via email, since the email system failed as well. This way, the company violated essential duties of its supply agreements and had to bear contractual penalties in addition to the costs incurred as a consequence of the loss of production.