S 1.52 Redundancy, modularity, and scalability in the technical infrastructure

Initiation responsibility: Top Management

Implementation responsibility: Planner

The best method for ensuring the availability of technical equipment is the method of redundancy. Redundancy means having more of something than is actually necessary to perform a given task (from Latin: "redundare", to overflow; be in excess). In the IT industry, redundancy means the existence of functionally equivalent or comparable resources of a technical system. Therefore, the main problem regarding redundancy is immediately clear: in order to have redundancy, it is necessary to create overcapacities.

A system is referred to as modular when a required technical output is generated by one large or several small units. Clever application of modularity may significantly reduce the overcapacity needed to achieve redundancy.

Even the most far-sighted planning cannot be so good that it will not be necessary to adapt the existing technical systems to changed, usually higher performance requirements after some time. The easier it is to expand a system simply by adding additional units, the more scalable the system is. Modularity can also have a positive effect on the scalability.

The simplest form of redundancy is N+1 redundancy. In this case, the number of units needed (N, and usually N=1) is complemented by a additional unit. If the original unit needed fails, the additional unit takes over its function. This redundancy offers adequate protection against malfunctions in the technical equipment itself. N+1 redundancy is therefore also referred to as operational redundancy.

Figure 1: N+1 redundancy with N=1

However, if one of the two units is being serviced and therefore not ready for operation, there is no redundancy during this time. Furthermore, an overcapacity of 100% is necessary in this model to achieve simple operational redundancy.

If redundancy also needs to be guaranteed when performing maintenance, N+2 redundancy must be established. In this case, two additional systems are installed as backups for the active system (N=1).

Figure 2: N+2 redundancy with N=1

One redundant system will always be available even when one of the three systems is unavailable because it is being serviced. However, an overcapacity of 200% is necessary for this. For this reason, such solutions quickly reach spatial and financial limits.

Modularity is a valuable aid in this case. For example, if the value 2 is used for N instead of the value 1, it is apparent that N+2 redundancy is clearly more favourable.

Figure 3: N+2 redundancy with N=2

The overcapacity required is reduced from 200% to 100% while providing the same level of redundancy (operational and maintenance redundancy). If the level of modularity is increased to N=4, for example, the result is even more favourable:

Figure 4: N+2 redundancy with N=4

There are 4 units available to cover the base load, each of which provides 25% of the total capacity required. Another two units, each of which provides 25% as before, are added to achieve operational and maintenance redundancy. Only 50% overcapacity is required in this case.

The higher the value of N, the lower the level of overcapacity required. However, it should be clear that this cannot be done ad infinitum. Increasing the level of modularity lowers the costs for the required overcapacity. However, the costs for installing and operating the units increase at the same time. All units (the last example already uses 6 units) must be installed and supplied with power in such a way that a single external event will not affect all units at the same time.

Modularity also automatically offers the advantage of scalability. As soon as the performance requirements are raised, another small unit (25%) can be added to the 4 active units. With N=1, it would be necessary to duplicate the first system to maintain the current level of redundancy.

Figure 5: Easy scalability

Modularity also has the additional advantage that the capacity still available after more than 2 units fail is higher.

N+2 redundancy guarantees that when two units fail, the remaining capacity (100%) will be sufficient to continue normal operations. In the case of N+2 redundancy with N=1, the capacity remaining after the failure of a third unit will be equal to zero. In contrast, when N+2 redundancy is used with N equal to 4 and 3 of the 6 units present fail, there will still be 75% capacity available. With appropriate load management, it is then possible to maintain quite problem-free operations.

Figure 6: Diagram of the remaining capacity as the value of N increases when using N+2 redundancy

Since the resources available are usually limited, it is not always possible to actually install 2 additional units to achieve operational and maintenance redundancy. Since it is generally possible to plan maintenance well enough in advance, the second unit could be used as a mobile unit and connected temporarily when needed.

Such a mobile unit can be held in reserve in the organisation itself or can be leased from an external service provider. In this case, corresponding SLAs must be drawn up with the service provider and the connection points needed must be prepared accordingly.

Examples:

When using air conditioning systems, an adequate redundancy should be provided. If 6 units of a given component are needed, 7 units should be purchased. This also enables the organisation to handle peak loads, for example on hot summer days, and to maintain the overall availability of the air conditioning system in case of the failure of a unit or when performing maintenance work.
It should also be examined which areas require redundant communication connections (see also S 6.18 Provision of redundant lines). This is especially important when central network nodes or central active components are located in unmonitored areas.
The power supply for a computer centre should be designed redundantly. Recommendations regarding this can be found in S 1.56 Emergency power system. If the secondary power supply is not located in an adjacent fire zone, consideration should be given to installing redundant power supply cables.

Review questions:

Was a load determination performed to ensure that IT operations will remain unaffected by the failure of redundant systems even under unfavourable conditions?