S 2.498 Handling warnings and error messages

Initiation responsibility: Head of IT, IT Security Officer

Implementation responsibility: Administrator

Structured and comprehensible processes must be introduced and the implemented safeguards must be documented for handling warnings and error messages.

These processes should describe who is responsible for processing the message (roles or persons) and how the information about the message is transmitted (e.g. email, SMS, generation of a trouble ticket).

If the organisation already has an alarm concept, the warnings and error messages of the network management must be embedded in it.

Below, possible events, causes, and reactions for warnings and error messages are listed.

Warnings

A warning may be triggered by different events, for example:

The thresholds defined in the network concept and configured in the network management system are exceeded or fallen below.
Offered services are not provided with the required quality.
Anomalies occur in the network traffic that are detected by the network management system.

Possible causes may include:

Newly introduced business processes require an unexpectedly high bandwidth.
There is an interesting internet offer used by many employees, the live stream of a game at a football world cup, for example.
A computer in the internet network is infected with malware trying to communicate using inadmissible ports.
The organisation is attacked from the outside.
Peer-to-peer (P2P) services were used inadmissibly and cause the internet connection to be overloaded.
An attempt to connecting an IT system to the internal network in an unauthorised manner was made.
Inadmissible protocols are used (e.g. remote desktop connection to a computer outside of the internal network).
Somebody tries different passwords in order to log in to an active network component in an unauthorised manner.

Depending on the cause of the warning, the persons in charge must initiate the corresponding safeguards:
If the warning is triggered because a threshold was exceeded or fallen below, for example due to an intensively used internet offer, either technical and/or organisational safeguards may be taken as a remedy. If it is foreseeable that the incident will not be repeated in this form, no reaction is required.
If a virus infection is suspected, the system should be checked for malware.
If it must be assumed that the problem will occur again, it must be clarified whether a modified configuration of active network components could be a remedy, for example by enabling or disabling services. If it is an already known problem, updates or patches made available by the manufacturer may be installed (see S 1.14 Patch and change management).
If thresholds are infringed continuously, it must be considered whether the impairment of services within the network can be counteracted by additional or more powerful hardware. Such a measure could include the migration from fast Ethernet to a connection with higher transmission rates in certain sections of the network. Once the safeguards are completed, the network plan must be updated.
If the warning cannot be eliminated by individual safeguards, it may be necessary to consider a change regarding the topology and to initiate a network design process (see S 4.1 Heterogeneous networks). For example, services could be designed redundantly or an advanced network topology may be necessary.

Error messages:

Error messages always indicate the failure of an active network component or a service monitored by network management. In general, a failure may be caused with or without outside influence.

Failure of IT systems offering services in the network (e.g. email servers).
Failure of active network components (e.g. a port at the switch is faulty).
Failure of passive network components (e.g. cable was accidentally damaged during conversion work).

The causes for the error may be manifold, including:

Environmental factors such as heat and water cause a hardware fault or there is an error in the technical infrastructure, e.g. a power failure.
A software vulnerability causes an IT system or an active network component to crash.
An IT system was attacked successfully from the inside or the outside and fails.
Productive systems were disturbed while security facilities were tested. For example, emergency power supply does not start during a test.

Finding the cause is very important. The objective must be to avoid such errors in the future or to at least remedy such errors as quickly as possible if they do occur again. If several unfavourable circumstances come together, it is difficult to find the causes and their interactions. In order to remedy an error, the following safeguards may be successful, for example:

A faulty hardware component is replaced or crashed software is re-installed.
It may also be possible or desirable to repair a faulty hardware component.
If there is a standby system for a failed IT system (cold or hot standby), this system is used instead of the faulty IT system.

It must be the primary objective to remedy occurring errors. Nevertheless, it is also important to learn how such errors can be avoided in the future. The analysis of the error and the initiated measures should be documented.

Review questions:

Were comprehensible processes introduced and the implemented safeguards documented for handling warnings and error messages?
Were the warnings and error messages of the network management integrated into an already existing alarm concept?