T 2.149 Insufficient storage capacity for virtual IT systems

Virtualisation servers require disk space provided either locally in the virtualisation server itself or in a storage network in order to operate the virtual IT systems. If the storage capacities required for the aforementioned are planned insufficiently, there are comprehensive risks regarding the availability of the virtual IT systems and the integrity of the information processed by these systems. This is particularly applicable if special virtualisation functions such as snapshots or the overbooking of disk space are used. Bottlenecks must not only refer to the disk space of hard disks or in storage networks, but also the internal memory (RAM).

Virtualisation functions such as snapshots occupy additional disk space

Freezing and storing operational conditions of virtual IT systems (Snapshots) require sufficient disk space. For example, the content of the virtual bulk storages and possibly also the conditions of main memory and processor are written to the hard disk when a snapshot is created. Additionally, a differential file is generated during the runtime of the guest system for some virtualisation solutions. Together with the original condition of the data present before the snapshot was created on the virtual data medium, this differential file forms the current content of the virtual hard disk. Standby functions which allow the stopping of virtual machines during live operations also use a similar technology and therefore occupy storage resources until operation is continued.

Overbooking of disk space

Another particularity of virtual environments is that disk space can be overbooked. This means no fix disk space is reserved when a virtual IT system is assigned a certain storage capacity. Instead, the disk space is only assigned to the virtual IT system in the physically present resources when it is actually used by the virtual IT system. In this case, one hundred gigabytes are visible to the virtual system, but it actually only requires the currently used disk space, for example.

The overbooked disk space can be implemented by a growing file container stored on a hard disk physically installed in the virtualisation server or in a storage network, for example. This container becomes larger and larger the more it is used. If data is deleted within the virtual IT system using this container, the container is normally not reduced in size automatically, however.

Regardless of whether the data medium the container of the virtual IT system was stored to is present locally or in the network, its size is limited by the physically available disk space. Without prudent planning of the required maximum capacities, this may easily result in problems. If the storage has been overbooked excessively, it is possible all the free space will be taken up sooner than it should be. The memory requirements of the virtual IT system can then not be covered in the physical medium and an error situation occurs for the virtual machine affected by this. This is because no additional memory may be provided for the guest system by the virtualisation server, although free memory appears to be usable from the virtualised IT system's point of view. In such a situation, many virtualisation products make do with only allowing read access to the virtual hard disk affected by the overbooking in order to protect the data present up to this point in time. As a consequence, data on these virtual hard disks may become inconsistent. The virtual IT system may possibly even fail completely if the operating system of the virtual IT system is not able to compensate the occurring errors, for example. Other virtualisation solutions automatically create a snapshot of the affected systems and then shut these systems down when the physical memory is no longer available.

This approach disturbs the availability of the services of these virtual IT systems. Moreover, the operation of all guest systems ran by the virtualisation server is impaired similarly if all disposable physical resources of the virtualisation server are exhausted.

Example:

An internationally operating trading company uses an ERP system (Enterprise Resource Planning) in order to automate and support different processes such as purchasing, amongst other things. In order to provide the field service agents of the company with access to the ERP system, the company provides a terminal server farm used by the agents in order to book their purchases and to participate in the corporate communication (intranet and email). The platform must be available at all times, since the agents work in the field of commodity futures and therefore the exact time of the purchasing activities is decisive in order to achieve a good price.

For cost-related reasons, the management of the company decides to operate the terminal server farm and the ERP systems as virtual IT systems in the future. While analysing the existing physical systems, the planning team determines that the hard disks of the existing systems are only utilised insignificantly. However, some database systems occasionally require more space when the purchasing figures are analysed at monthly intervals. This disk space is immediately released once the analysis is finished.

Moreover, using the snapshot function of the virtualisation servers when changing the ERP system versions is planned. Since errors occasionally occur during the updates, this function is to be used in order to be able to quickly undo the changes. A time-consuming recovery of the pre-update status may quickly have adverse effects on the business success in the field of commodity futures. For this reason, the snapshot functions of the virtualisation servers are an important factor for introducing the virtualisation technology in this company.

Since the hard disk space of the physical systems is only utilised insignificantly, it is assumed that it is sufficiently dimensioned as a reserve for the snapshots. Therefore, they decided to only design a storage capacity as is currently available in the physical systems as a whole in the storage network configured for the virtual IT systems. This was deemed sufficient, since it was not possible to consume more disk space than actually physically present when updating the physical systems.

Shortly before the end of a month, the ERP software is updated. For this, the ERP systems themselves, as well as the terminal servers must be updated, since the new and urgently required functions can only be used if the client software on the terminal servers is replaced as well. In order to prevent possible malfunctions, a snapshot of all systems, the ERP systems, and the terminal servers was created prior to the update. The snapshot was created after the monthly operating figures had been generated in order to quickly have the operating figures available based on the old software in the event of an update failure.

As of this point in time, all changes to the hard disk containers of the virtual IT systems are written to a differential file and the memory consumption in the storage network increases by leaps and bounds. It was not taken into consideration that the files replaced during the process of updating the software are not physically overwritten by the snapshot, but continue to be present in the snapshot. Therefore, the memory requirements for updating the virtual IT systems doubled.

Within the framework of the monthly analysis, the disk space in the storage network is completely exhausted so that no further data can be written. At this point, it was again not taken into consideration that the space for the analysis must be re-assigned in the differential file. The administrator of the virtual IT system responsible for the analysis recognised the scarce disk space in the virtual hard disk and therefore deleted the old analysis before creating the new one. However, this does not have any effect on the physically occupied disk space, since the physical disk space used for analysing the old data is now part of the snapshot.

The virtualisation software automatically protects the virtual IT systems against a loss of data and data inconsistency by stopping the virtual IT systems. This causes the complete and simultaneous failure of all terminal servers and ERP systems. The agents are completely cut off from the corporate communication and cannot be informed about the failure. This causes a delay in the transactions performed in commodity futures and the company must pay significantly higher prices for the purchased goods.

Before it was possible to revive the failed systems, free disk space had to be created for the virtual IT systems. The administrators had the choice of either resetting the virtual IT systems to the snapshot or expanding the memory in the storage network, Since the terminal server farm and the ERP had to be available quickly, they decided to reset the systems to the snapshot. Therefore, the payroll costs for updating the systems had to be written off.

After the disk space actually required in order to update the ERP software with the help of snapshots had been determined properly, the storage capacity of the storage network was then expanded. It was only possible to use the urgently required function extensions of the updated software after this expansion had been performed.