S 6.92 Contingency planning for routers and switches

Initiation responsibility: Head of IT, IT Security Officer

Implementation responsibility: Administrator

Troubleshooting for routers and switches

Malfunctions ranging from sporadically occurring incorrect behaviour of components to failures clearly limited to a single device and network failures caused by this can occur in every IT system. The basis of secure IT operations is preparation for the situation when a malfunction occurs. This includes preparation for the failure of or damage to hardware and software, for example due to defects or compromised systems.

To be able to react quickly and effectively in such situations, diagnostics and troubleshooting must be planned and prepared in advance. Instructions should be created for typical failure scenarios and for failures that have already occurred in the organisation. Cookbook-like documentation of all necessary commands, their application with the outputs to be expected are particularly helpful in situations where a fast response is called for. This does not only include diagnostics and error handling, but also the administrative actions necessary in normal operation. Typically, the latter will be contained in the documentation provided by the manufacturer. However, for daily operations it is sensible to create an overall set of documents in the form of an operating manual.

A suitable logging function running during operations is also a prerequisite for the success of the diagnostics work (see also S 4.205 Logging on routers and switches). In addition, suitable tools should be used for error handling. There are free programs as well as commercial programs available for handling errors, often from the manufacturer of the device. The use of suitable tools becomes all the more important since not all system commands can be used to display all configuration settings. Sometimes, only the data deviating from the default settings is captured.

The approach for handling errors can be divided into the areas of administration, performance measurement, and diagnostics. The aspects to be taken into account in each of these three areas are illustrated in the following:

Administration

All commands necessary for administration and configuration must be documented in an operating manual.

The following aspects must be taken into account:

setup of users, granting of authorisations
updating the operating system
configuration
- interface
- line ports
- access control lists
- routing
logging

Performance

The following aspects should be taken into account when reporting the performance:

incoming and outgoing traffic (per interface or port)
throughput or traffic per interface
statistical information about the protocols used

Diagnostics

For diagnostics purposes, all necessary commands and the outputs to be expected for viewing the status of the entire system, the interfaces, and their configuration should be documented. Moreover, many commands allow for a debug mode for outputting comprehensive status information.

The following information, amongst other things, is relevant when diagnosing errors:

status of the network interfaces and the other connections
status of the TCP and UDP network services
overview of the overall configuration
processes
routing table and routing protocols used
ARP table
users logged in
DNS and nslookup information
logging (use of the log levels, interpretation of the log information)

S 2.215 Error handling should be taken into consideration as a further safeguard.

Contingency planning to increase availability

Planning the procedure to follow when malfunctions occur can minimise the restoration time and even may be the only way to make a solution possible under some circumstances. The planning must be coordinated with the overall malfunction and contingency planning and should be based on the general business continuity planning concept (see module S 1.3 Business continuity management). The general specifications for business continuity documents for the entire IT system are formulated here. Ideally, they specify uniform and binding requirements regarding the layout, contents, and form of the documents.

The following questions are relevant to contingency planning:

What are the monitoring requirements?
Compilation of the information that must always be evaluated by the personnel responsible for the operation of the network components (see also the Logging section)
- How can the early detection of errors be guaranteed?
What are possible reasons for malfunctions?
- hardware defects
- inadequate dimensioning (failure when the load increases)
What safeguards can be taken?
- standby equipment
- spare parts
- implementation of failover solutions that make it possible to switch over to an alternative unit during live operation
- maintenance agreements
What Service Level Agreements (SLAs) are there or should be concluded?
- hardware suppliers (for example, on-site replacement with response time guarantee for certain components)
- internal service level requirements
How will diagnostics be performed?
- status queries
- display of configuration
- processes
- routing
- users logged on
- logging
What correction procedures must be performed?
- procedures in the event of failure of the complete system (restoration of operating system and configuration)
- procedure in the event of failure of sub-components, e.g. memory
Who must be informed in the event of damage?
- server and application administration
- hardware supplier / contact person for maintenance agreement
What documents must be available when damage occurs?
- configuration
- ACLs (rules)

The documentation must be available in a form other than in electronic form. Instructions should also be available at least in paper form as well. If necessary, configuration files can also be stored separately on CD-ROMs or other data media.

What is the recovery sequence?
- dependencies on other network components / areas of the IT network
- reinstallation of operating system and configuration
- playback of a backed up configuration
- scope for limited operation

Care must be taken when drawing up the procedure descriptions necessary for contingency planning, and the procedures must be tested regularly. In some cases, different procedures must be written for different types of devices and operating systems.

Probably, the most important safeguard for increasing the availability is keeping a reserve of spare parts to minimise the downtime in the event of a hardware defect. As an alternative or in addition to this, service contracts can be signed with the manufacturer ensuring the availability through guaranteed response times or even guaranteed repair times. As a result of this, the costs for storage can be reduced or an even higher level of hardware availability can be attained. The supply of software updates can also be regulated within the framework of such a contract.

Review questions:

Have corresponding instructions been defined for diagnostics and troubleshooting on routers and switches in advance?
Have the administrative activities required during normal operations of the routers and switches been defined in an operating manual?
Have all commands required for diagnostics and the related indication of the status of the entire system, the interfaces, and their configuration been documented?
Is there a business continuity concept coordinated with the overriding malfunction and contingency planning?
Has it been ensured that the contingency planning documentations and the instructions contained therein exist in paper form?
Are the instructions described in contingency planning drilled regularly?