T 4.78 Failure of virtual machines due to unfinished data backup processes

Classic data backup methods are based on agents installed on the IT systems to be backed up. These agents transmit the data to be backed up from the IT system to the data backup server. This server in turn forwards the data to the data backup devices.

IT systems and the bulk memory used by IT systems can be uncoupled thanks to the introduction of storage networks. This means that the data backup may be transmitted to the data backup server by the storage network instead of from the IT system to be backed up itself. For some storage network products, the data backup devices themselves are part of the storage network and are only controlled by the data backup server. This relieves the backed up IT system and the data backup server from transporting the data for data backup.

This concept is emulated and extended by some virtualisation products. For example, the virtualisation servers may provide the bulk memory of virtual IT systems (virtual hard disks) to a data backup system so that this data backup system can backup the data stored on the bulk memory. It is necessary that this virtual hard disk is in a consistent condition so that no inconsistent data is backed up. In order to achieve this, the content of the virtual hard disk is frozen (snapshot). This procedure is completely transparent for the backed up virtual IT system. Since the virtual IT system to be secured continues to run and changes to this hard disk are still performed, these changes are written to a differential file. In so doing, the disk space required by this IT system in total increases. The final size of this differential file depends on how many changes are performed in the file system of the virtual IT system during the backup procedure. Once data backup is complete, the changes performed in the meantime are applied to the frozen condition and the differential file is deleted.

If the data contained in the virtualisation environment is backed up incompletely due to the long runtime of the data backup procedure or due to communication issues in the network, the differential file created when the snapshot was made may become very large. This file may be maintained permanently if the data backup process is cancelled suddenly. As a result, the disk space containing the virtual hard disks of the virtual machines to be backed up may be completely exhausted, particularly if several virtual IT systems are backed up in this way simultaneously.

If the disk space used for the differential file mentioned above is exhausted, the virtualisation server denies any further write accesses to the virtual hard disk for the virtual IT system and the system switches to an error situation. This may cause the crash of the virtual IT system if the operating system is not able to compensate this error situation.

Example:

The operator of a computer centre virtualised a large number of his server systems. These servers process large amounts of data on a daily basis. This data must be backed up at daily intervals.

The data backup puts a considerable strain on the virtual IT systems due to the high data volume and may no longer only be performed during the night. As a consequence, losses of performance occur during the normal working hours. As a consequence, it was decided to no longer perform data backup in the classic, agent-based manner, but to use snapshots. These snapshots are created in the evenings at a certain time in each case, the data is stored and the snapshots are deleted as soon as the backup procedure is complete.

This solution runs smoothly for some time, but the data backup volume soon increases in such a significant way that a new data backup process is triggered before the old process is complete. Shortly afterwards, all virtual IT systems on the virtualisation server fail, since the disk space available is exhausted.