S 6.92 Contingency planning for routers and switches

Initiation responsibility: Head of IT, IT Security Officer

Implementation responsibility: Administrator

Troubleshooting for routers and switches

Malfunctions ranging from sporadically occurring incorrect behaviour of components to failures clearly limited to a single device and network failures caused by this can occur in every IT system. The basis of secure IT operations is preparation for the situation when a malfunction occurs. This includes preparation for the failure of or damage to hardware and software, for example due to defects or compromised systems.

To be able to react quickly and effectively in such situations, diagnostics and troubleshooting must be planned and prepared in advance. Instructions should be created for typical failure scenarios and for failures that have already occurred in the organisation. Cookbook-like documentation of all necessary commands, their application with the outputs to be expected are particularly helpful in situations where a fast response is called for. This does not only include diagnostics and error handling, but also the administrative actions necessary in normal operation. Typically, the latter will be contained in the documentation provided by the manufacturer. However, for daily operations it is sensible to create an overall set of documents in the form of an operating manual.

A suitable logging function running during operations is also a prerequisite for the success of the diagnostics work (see also S 4.205 Logging on routers and switches). In addition, suitable tools should be used for error handling. There are free programs as well as commercial programs available for handling errors, often from the manufacturer of the device. The use of suitable tools becomes all the more important since not all system commands can be used to display all configuration settings. Sometimes, only the data deviating from the default settings is captured.

The approach for handling errors can be divided into the areas of administration, performance measurement, and diagnostics. The aspects to be taken into account in each of these three areas are illustrated in the following:

Administration

All commands necessary for administration and configuration must be documented in an operating manual.

The following aspects must be taken into account:

Performance

The following aspects should be taken into account when reporting the performance:

Diagnostics

For diagnostics purposes, all necessary commands and the outputs to be expected for viewing the status of the entire system, the interfaces, and their configuration should be documented. Moreover, many commands allow for a debug mode for outputting comprehensive status information.

The following information, amongst other things, is relevant when diagnosing errors:

S 2.215 Error handling should be taken into consideration as a further safeguard.

Contingency planning to increase availability

Planning the procedure to follow when malfunctions occur can minimise the restoration time and even may be the only way to make a solution possible under some circumstances. The planning must be coordinated with the overall malfunction and contingency planning and should be based on the general business continuity planning concept (see module S 1.3 Business continuity management). The general specifications for business continuity documents for the entire IT system are formulated here. Ideally, they specify uniform and binding requirements regarding the layout, contents, and form of the documents.

The following questions are relevant to contingency planning:

The documentation must be available in a form other than in electronic form. Instructions should also be available at least in paper form as well. If necessary, configuration files can also be stored separately on CD-ROMs or other data media.

Care must be taken when drawing up the procedure descriptions necessary for contingency planning, and the procedures must be tested regularly. In some cases, different procedures must be written for different types of devices and operating systems.

Probably, the most important safeguard for increasing the availability is keeping a reserve of spare parts to minimise the downtime in the event of a hardware defect. As an alternative or in addition to this, service contracts can be signed with the manufacturer ensuring the availability through guaranteed response times or even guaranteed repair times. As a result of this, the costs for storage can be reduced or an even higher level of hardware availability can be attained. The supply of software updates can also be regulated within the framework of such a contract.

Review questions: