S 6.98 Contingency planning for storage systems

Initiation responsibility: Head of IT, IT Security Officer

Implementation responsibility: Administrator, Head of IT

Troubleshooting storage systems

Malfunctions ranging from sporadic errors in components to failures limited to a single device can occur in every IT operation. The basis of secure IT operations is preparation for the situation when a malfunction occurs. This includes preparation for the failure of or damage to hardware and software, for example due to defects or compromised systems.

To be able to react quickly and effectively in such situations, diagnostics and troubleshooting must be planned and prepared in advance. Instructions should be created for typical failure scenarios and for failures that have already occurred in the organisation. Cookbook-style documentation containing the measures to take and commands to execute to support error analysis and correction is particularly helpful.

The depiction of the links and dependencies, which are different in each company, is critical to the evaluation of malfunctions and to quick and secure intervention, especially for complex systems such as storage systems.

A suitable logging function running during operations is also a prerequisite for the success of the diagnostic procedures (see also S 2.359 Monitoring and administration of storage systems). In addition, suitable tools should be used for error handling. There are free programs as well as commercial programs available for handling errors, often from the manufacturer of the device. The use of suitable tools becomes even more important in complex systems, since it is not necessary to control and operate the individual components, but to obtain an overview of the interaction between the hardware and software of the often very heterogeneous overall system.

It must be clear that storage systems in particular can only be returned to normal operation after malfunctions and emergencies if a usable data backup is available. Tests of the ability to restore the data backups must be performed regularly (see S 6.22 Sporadic checks of the restorability of backups).

The procedure for handling errors in storage systems can be divided into the areas of administration, performance measurement, and diagnostics. The aspects to be taken into account in each of these three areas are explained below:

Administration

All commands necessary for administration and configuration must be documented in an operating manual.

The following aspects must be taken into account:

Performance

The following aspects should be taken into account when measuring and reporting the performance:

Diagnostics

All commands necessary for diagnosing errors as well as the expected output and the meaning of this output should be documented. This includes, for example, information relating to the status of the various system components and interfaces, as well as information on the current configurations.

The following information, amongst other things, is relevant when diagnosing errors:

Contingency planning to increase availability

Planning the procedure to follow when malfunctions occur can minimise the restoration time and even may be the only way to make a solution possible under some circumstances. The planning must be coordinated with the overall malfunction and contingency planning and should be based on the general business continuity planning concept (see module S 1.3 Business continuity management). The general specifications for business continuity documents for the entire IT system are formulated here. Ideally, they specify uniform and binding requirements as well as the layout, contents, and form of the documents.

The exact availability requirements for the storage systems must be clearly defined.

The following questions are relevant to contingency planning:

Administration of Service Level Agreements:

The term of an SLA is generally for a limited period of time, and they are not always automatically renewed. Furthermore, it is often the case that the cost increases significantly as the length of the term of an SLA increases or that they are not even offered any more for outdated systems so that investment in a new storage systems is often less expensive. This fact must be taken into account in due time and planned for accordingly.

Documentation of the contingency planning

The exact procedure in certain emergency situations must be described in a contingency plan. This plan contains the following points:

Care must be taken when drawing up the procedure descriptions necessary for contingency planning, and the procedures must be tested regularly. In some cases, different procedures must be written for different types of devices and operating systems.

The documentation must be available in a form other than in electronic form. Instructions should also be available at least in paper form as well. If necessary, configuration files can also be stored separately on CD-ROM.

The most important safeguard for increasing the availability is keeping a reserve of spare parts is probably to minimise the downtime in the event of a hardware defect. As an alternative or in addition to this, service contracts can be signed with the manufacturer ensuring the availability through guaranteed response times or even guaranteed repair times. As a result of this, the costs for storage can be reduced or an even higher level of hardware availability can be attained. The supply of software updates can also be regulated within the framework of such a contract.

Review questions: