S 6.98 Contingency planning for storage systems

Initiation responsibility: Head of IT, IT Security Officer

Implementation responsibility: Administrator, Head of IT

Troubleshooting storage systems

Malfunctions ranging from sporadic errors in components to failures limited to a single device can occur in every IT operation. The basis of secure IT operations is preparation for the situation when a malfunction occurs. This includes preparation for the failure of or damage to hardware and software, for example due to defects or compromised systems.

To be able to react quickly and effectively in such situations, diagnostics and troubleshooting must be planned and prepared in advance. Instructions should be created for typical failure scenarios and for failures that have already occurred in the organisation. Cookbook-style documentation containing the measures to take and commands to execute to support error analysis and correction is particularly helpful.

The depiction of the links and dependencies, which are different in each company, is critical to the evaluation of malfunctions and to quick and secure intervention, especially for complex systems such as storage systems.

A suitable logging function running during operations is also a prerequisite for the success of the diagnostic procedures (see also S 2.359 Monitoring and administration of storage systems). In addition, suitable tools should be used for error handling. There are free programs as well as commercial programs available for handling errors, often from the manufacturer of the device. The use of suitable tools becomes even more important in complex systems, since it is not necessary to control and operate the individual components, but to obtain an overview of the interaction between the hardware and software of the often very heterogeneous overall system.

It must be clear that storage systems in particular can only be returned to normal operation after malfunctions and emergencies if a usable data backup is available. Tests of the ability to restore the data backups must be performed regularly (see S 6.22 Sporadic checks of the restorability of backups).

The procedure for handling errors in storage systems can be divided into the areas of administration, performance measurement, and diagnostics. The aspects to be taken into account in each of these three areas are explained below:

Administration

All commands necessary for administration and configuration must be documented in an operating manual.

The following aspects must be taken into account:

Setup of (administrative) users, granting of authorisations
Firmware and operating system updates
Configuration
- of storage resources
- of administrative access
- of the connected servers and backup devices

Logging

Performance

The following aspects should be taken into account when measuring and reporting the performance:

Allocation of the media (for each logical or physical device)
Throughput of each interface
Statistical utilisation information

Diagnostics

All commands necessary for diagnosing errors as well as the expected output and the meaning of this output should be documented. This includes, for example, information relating to the status of the various system components and interfaces, as well as information on the current configurations.

The following information, amongst other things, is relevant when diagnosing errors:

Status of the network interfaces and the other connections
Status of the network services (TCP/IP for NAS systems; specific information for SANs, e.g. the status of the SAN switches)
Overview of the overall configuration
Processes
Assignment
Users logged in
Logging (use of the log levels, interpretation of the log information)

Contingency planning to increase availability

Planning the procedure to follow when malfunctions occur can minimise the restoration time and even may be the only way to make a solution possible under some circumstances. The planning must be coordinated with the overall malfunction and contingency planning and should be based on the general business continuity planning concept (see module S 1.3 Business continuity management). The general specifications for business continuity documents for the entire IT system are formulated here. Ideally, they specify uniform and binding requirements as well as the layout, contents, and form of the documents.

The exact availability requirements for the storage systems must be clearly defined.

The following questions are relevant to contingency planning:

What are possible reasons for malfunctions?
- hardware defects
- capacity of the current design is too low (malfunctions or failures when usage increases)
What are the monitoring requirements?
How can it be ensured that malfunctions are detected early?
Compilation of the information that must always be evaluated by the personnel responsible for the operation of the storage systems
What safeguards can be taken?
- replacement devices
- replacement parts
- implementation of failover solutions which enable switching to an alternative device during live operations
- maintenance contracts
- employee training
Which Service Level Agreements (SLAs) should be concluded?

Administration of Service Level Agreements:

The term of an SLA is generally for a limited period of time, and they are not always automatically renewed. Furthermore, it is often the case that the cost increases significantly as the length of the term of an SLA increases or that they are not even offered any more for outdated systems so that investment in a new storage systems is often less expensive. This fact must be taken into account in due time and planned for accordingly.

Documentation of the contingency planning

The exact procedure in certain emergency situations must be described in a contingency plan. This plan contains the following points:

How will diagnostics be performed? The following information can be helpful in this case:
- status queries
- display of the configuration
- display of the processes currently running
- users logged in
- logging
What correction procedures must be performed?
- procedure if the entire system fails (restoration of the operating system and configuration)
- procedure if subcomponents fail, for example the storage system
Who must be informed in the event of damage?
- server and application administration
- hardware supplier/contact person for the maintenance contract
- all necessary information on the maintenance contracts and Service Level Agreements, hotline numbers, customer or device identification numbers
What documents must be available if damage occurs?
- basic configuration for (re)starting operation
- changes to the basic configuration to set up the current operating configuration
- rules for controlling access (Access Control Lists or ACLs)
- users configured and their authorisations
- passwords for emergency access
How is a restart performed?
- dependencies on other systems of the overall IT system
- reinstallation of the operating system and configuration
- restoring a backed up configuration
- possibility of limited operations
- remote operation at another location

Care must be taken when drawing up the procedure descriptions necessary for contingency planning, and the procedures must be tested regularly. In some cases, different procedures must be written for different types of devices and operating systems.

The documentation must be available in a form other than in electronic form. Instructions should also be available at least in paper form as well. If necessary, configuration files can also be stored separately on CD-ROM.

The most important safeguard for increasing the availability is keeping a reserve of spare parts is probably to minimise the downtime in the event of a hardware defect. As an alternative or in addition to this, service contracts can be signed with the manufacturer ensuring the availability through guaranteed response times or even guaranteed repair times. As a result of this, the costs for storage can be reduced or an even higher level of hardware availability can be attained. The supply of software updates can also be regulated within the framework of such a contract.

Review questions:

Are instructions created for typical failure scenarios and for failures that have already occurred in the organisation?
Is the restorability of data backups at least tested sporadically?
Are all commands necessary for administration and configuration documented in the operating manual?
Is the procedure to follow when malfunctions occur coordinated with the overall malfunction planning and based on the general business continuity planning concept?
Are SLA s planned in time, concluded and, if necessary, renewed?
Are the procedure descriptions created for emergencies tested at regular intervals?
Is the contingency planning documented at least electronically and in paper form?