T 4.77 Resource bottlenecks due to improperly functioning guest tools in virtual environments

For many virtualisation products, so-called guest tools can be installed in the virtual IT systems. On the one hand, these tools serve for providing specific, optimised device drivers for the virtual hardware components of a virtual machine. On the other hand, the virtualisation server may use these guest tools to control the resource consumption of a virtual IT system for certain products. This is particularly necessary if the virtualisation product used allows overbooking of resources such as internal memory or hard disk space. For example, if two virtual IT systems compete for internal memory capacity, the host operating system or a hypervisor may order the guest tools to reserve virtual RAM and therefore its physical equivalent in one of the virtual IT systems. The physical representation of this storage is not used by the virtual IT system now and is controlled by the hypervisor using the guest tools. The hypervisor may now provide this physical storage to the other virtual IT system as virtual RAM. The other way around, a virtual IT system may use the guest tools to request main memory capacity. Such a technology is used in the ESX product of the manufacturer VMware, for example. Here, the storage is reserved with the help of a so-called Ballooning driver. This driver is included in the guest tools (VMware Tools).

Furthermore, the device drivers of the guest tools may be used to restrict resource access for some virtualisation products. For example, it is possible to restrict the bandwidth a virtual IT system uses to access the network or the storage network.

Therefore, programming errors in the guest tools may have wide-ranging consequences for the operation of the affected virtual IT systems due to their manifold functions, since numerous IT systems are affected simultaneously in the majority of the cases.

Device drivers

The most common operational purpose of the guest tools is to provide optimised device drivers for the emulated hardware (graphics card, network card, bulk memory) of the virtual IT systems provided by the virtualisation server. The emulated hardware may also be used by the virtual IT system with the help of the drivers included in the scope of delivery of the most commonly used operating system, but optimal use is only possible with the help of specifically adapted drivers. Since these drivers are normally also used in all virtual IT systems, an error in these drivers affects all virtual machines.

Overbooking of storage resources

If the main memory of the virtualisation server is overbooked and if storage requests within a virtual IT system are processed improperly by the guest tools, it may happen that processes are not provided with enough storage capacity.

Errors in bandwidth management

If the bandwidth management functions in the guest tools were programmed incorrectly, the guidelines defined for these may be ineffective. However, a virtual IT system may also be provided with far too little or no bandwidth at all.

For example, if a virtual system consistently causes high levels of network traffic and heavily utilises the physically existing resources, the connections of other virtual IT systems may be affected adversely so that connections of these IT systems are interrupted as a consequence, endangering their availability.

The administrator of the virtualisation server could now use the guest tools to either restrict the available bandwidth of the first virtual IT system or guarantee a certain minimum bandwidth for the other systems. If the bandwidth control guidelines have become ineffective due to a programming error, for example after updating the guest tools, the objectives pursued by these guidelines are not attained. The availability of the systems therefore continues to be limited.

If, despite proper guidelines, the error in the guest tools results in there being too low a bandwidth for the first IT system in the scenario described, the availability of this system may be limited, since it cannot use the required bandwidth to access the network. The same holds true for all other IT systems, the communication of which should be protected.

Example:

A medium-sized company operates a host of virtualisation servers in order to be able to provide its usual server infrastructure efficiently on these virtualisation servers. All services used in the company directly or indirectly depend on the virtual IT systems in the virtual infrastructure. There, systems such as the directory service, the central email server, as well as the ERP system are operated. Moreover, the file and printing servers are operated as virtual IT systems.

The systems run without any failures for a certain time. After an update of the virtualisation software on the virtualisation servers was completed, the central administration software of the virtualisation servers indicated that the guest tools of the virtual IT systems were no longer up to date. The administrator responsible decides to update the guest tools in the virtual IT systems. The administrator did not have any negative experiences with this update in the past. Since the administrator does not have administrator rights on all virtual IT systems, he/she uses a function of the virtualisation servers for updating purposes which enables updating of the tools on all virtualisation servers without any interaction with the individual virtual IT systems. The administrator starts the update two hours before the general start of the work on a working day. He/she observes how the guest tools are newly installed in the virtual IT systems and initially does not detect any obvious errors, since no error messages are logged on the console of the virtual systems.

However, after a certain number of virtual IT systems was updated, he/she recognises that these are no longer connected to the network. He/she investigates the problem and finds out that the network card drivers of the virtual IT systems were also updated as part of the guest tools. Here, the manufacturer committed an error which resulted in the operating systems of the virtual IT systems not recognising the virtual network card as new hardware. This renders the network card unconfigured. Only after the other administrators arrived at the company, could the network cards be re-configured. Until then, many employees in the corporate administration department are not able to access their data. As a result, large amounts of working time are lost.