S 2.314 Use of high-availability architectures for servers

Initiation responsibility: Head of IT, IT Security Officer

Implementation responsibility: IT Security Officer, Administrator

The availability of business processes, applications, and services often depends on the function of a central server. However, the more applications are run on a server, the more reliable this server must be. Normally, a server contains different potential sources of error (single points of failure), i.e. components the failure of which may cause the overall system to fail: CPU, hard disks, power supplies, fans, back plane, etc. The restoration of the overall system may take considerable amounts of time in this case. Along with the provision of spare parts, the following options may additionally be used in order to increase the availability:

cold standby
hot standby (manual switchover)
cluster (automatic switchover)
- load-balanced cluster
- failover cluster

Every single one of these techniques offers a different level of availability and is normally related to different costs.

Cold standby

For cold standby, a secondary replacement system identical in construction is provided parallel to the actual productive system, which is not active, however. Should the primary system fail, the replacement system can be booted and integrated into the network manually.

Along with the provision of individual spare parts, this is the simplest redundancy solution entailing the corresponding advantages and disadvantages:

Advantages of a cold standby solution	Disadvantages of a cold standby solution
Cold standby solutions do not increase the complexity for the overall system. The costs incurred by a cold standby system only amount to the costs for the additional hardware and therefore are lowest amongst the presented options. New installation of or changes to the system can be performed without any losses in availability. For this, productive operation is switched to the cold standby system during the changes.	A secondary system must be provided in addition to the primary system. The replacement system must constantly be provided with the latest configuration and patch status. Since the replacement system requires manual activation, administrators must continuously monitor the system and intervene in an emergency. If the application data is not stored to an external storage system so that access directly from the replacement system is possible, the data must be migrated to the cold standby system.

Advantages of a cold standby solution

Disadvantages of a cold standby solution

Cold standby solutions do not increase the complexity for the overall system.
The costs incurred by a cold standby system only amount to the costs for the additional hardware and therefore are lowest amongst the presented options.
New installation of or changes to the system can be performed without any losses in availability. For this, productive operation is switched to the cold standby system during the changes.

A secondary system must be provided in addition to the primary system.
The replacement system must constantly be provided with the latest configuration and patch status.
Since the replacement system requires manual activation, administrators must continuously monitor the system and intervene in an emergency.
If the application data is not stored to an external storage system so that access directly from the replacement system is possible, the data must be migrated to the cold standby system.

Table: Advantages and disadvantages of a cold standby solution

These solutions are well suited for servers containing applications where short and/or limited downtimes until administrator intervention are uncritical. Examples include:

servers in smaller networks (intranet)
sparsely frequented servers on the internet

Hot standby (manual switchover)

For hot standby, a replacement system must also be provided that is, however, maintained in operation parallel to the productive system. The functionality of the productive system is monitored and the replacement system is activated in the event of a failure. Switchover may be manual or automatic. The overall system must comprise additional functionalities for automatic switchover, e.g. automatic recognition of failures. This case is addressed in the next section in "Cluster".

In order to ensure that the downtimes are as short as possible, the condition of the replacement system must be checked continuously.

Advantages of a hot standby solution	Disadvantages of a hot standby solution
The downtimes are shorter when compared to a cold standby solution. Just like for cold standby, this solution is also relatively cheap when compared to higher quality availability solutions described in the following. The replacement system is operating and may also be used for data replication. New installation of or changes to the system can be performed without any loss of availability. For this, productive operations is switched to the hot standby system during the changes.	Only half of the existing hardware is used at all times. The replacement system must be kept up to date constantly. If the hot standby system is activated manually, continuous monitoring by a person in charge of the system is required.

Advantages of a hot standby solution

Disadvantages of a hot standby solution

The downtimes are shorter when compared to a cold standby solution.
Just like for cold standby, this solution is also relatively cheap when compared to higher quality availability solutions described in the following.
The replacement system is operating and may also be used for data replication.
New installation of or changes to the system can be performed without any loss of availability. For this, productive operations is switched to the hot standby system during the changes.

Only half of the existing hardware is used at all times.
The replacement system must be kept up to date constantly.
If the hot standby system is activated manually, continuous monitoring by a person in charge of the system is required.

Table: Advantages and disadvantages of a hot standby solution.

Using hot standby systems is suitable for applications for which short downtimes are uncritical. The problem of system monitoring during activation of the hot standby server must be taken into consideration in this. For example, possible fields of application include:

web servers with frequently varying content
servers in small networks (application servers, mail servers)
database servers and file servers (e.g. a secondary server continuously replicates a primary server and becomes the primary server in the event of a failure).

Cluster (automatic switchover)

A cluster consists of a group of two or more computers operated in parallel in order to increase the availability or the performance of an application or a service. In this, the application or service may be executed actively on one of the computers or distributed to several computers (performance enhancement).

Clusters are differentiated regarding

load-balanced clusters and
failover clusters

depending on the mode of operation.

Load-balanced clusters

For load-balanced clusters, instances of an application or of a service are distributed amongst the servers depending on the utilisation. If this is possible for an application or a service, this cannot only be used to achieve load balancing and therefore performance enhancement, but also to reduce the problems occurring during failures.

One of the prerequisites for using load balancing is that the respective applications or services must not require write data access.

In this case, redundancy may be provided by installing systems with similar performances "next to each other" with the help of a load balancing process and by guaranteeing that the other servers will compensate the failure of one server.

Advantages of a load-balanced cluster	Disadvantages of a load-balanced cluster
Both the availability and the performance can be increased using load-balanced clusters All available resources are used permanently. The solution is highly scalable. The complexity of the overall system is lower when compared to a failover cluster.	This cluster cannot be used for all kinds of applications. In particular, applications not using any pure read accesses and simultaneously requiring access to the same storage resources by all servers are not suitable for load balancing.

Table: Advantages and disadvantages of a load-balanced cluster

If, along with the availability, performance is important and if the application allows for distributed use, a load-balanced cluster is the ideal solution. This may be the case for the following, for example:

web servers, frontend applications with exclusive read accesses (e.g. web server farms) failover clusters

In this document, the term failover cluster refers to a cluster where active operation of the application or service is taken over automatically by another part of the cluster in the event of one of the cluster systems failing. The term failover refers to the automatic takeover of services during the failure of a system component by a functionally equivalent component. For the failover function, a dedicated heartbeat connection is usual ensuring the communication between the cluster servers. Along with the connection to the client network, the cluster servers must also be connected to the administration network in a dedicated manner in order to provide for direct access in the event of an emergency.

Automatic failover assumes that all software and hardware components are monitored appropriately. Therefore, it is important to ensure that the failover mechanism is not based on any incorrect assumptions.

The following items must be taken into consideration when using a failover cluster:

Access to shared memory:
Along with the server's own hard disks containing the operating system and the data required for operations, it is recommendable to manage the application data on a shared memory in a cluster.

The part of the cluster currently active is provided with access to these hard disks. It is also possible to use replicated hard disks instead of shared hard disks. This makes sense if the failover is performed from a remote location. During local failover, it should be considered whether the complexity created by replication and the dependencies relating to the aforementioned do not constitute an additional threat for availability.
Portability of the application:

Installing and commissioning an application in parallel on two or more servers requires the use of additional licenses in most cases. Furthermore, it must be checked whether the application allows for a failover functionality.
NSPoF (No Single Points of Failure):

If the failover functionality of the cluster may be affected adversely by the failure or a single component, this is contradictory to the actual purpose of the cluster architecture. In order to avoid single points of failure, the overall system must be analysed and the failure of individual components (power adapters, system memories, main memories, network cards, switches, hubs, etc.) must be taken into account.
Operating system and configuration of the cluster servers:

The cluster servers should be equipped with identical operating system versions, patches, libraries, and application versions. If the hardware and software configurations are as identical as possible, this guarantees that the behaviour in the event of a failover is as identical as possible. Therefore, identical systems reduce the complexity of the overall system (use of the same failover software, network interfaces, compatibility of the joint memory system, administration, service).
Dedicated and redundant connection between the servers:

Communication between the cluster servers must be as instantaneous as possible regardless of the network load so that the failover can be performed as quickly as possible. Redundancy is also required due to the high availability requirements.
Use of sophisticated software products for failover management:

The decision as to whether or not failover must be performed is very complex. New and self-developed tools may contain errors and thereby ultimately reduce the availability of the overall system.
Comprehensive testing of all possible failover aspects:

Amongst other things, comprehensive tests are required in order to determine that no unexpected single points of failure are present. In particular, server monitoring and failover management must be checked for any possible errors.

Advantages of a failover cluster	Disadvantages of a failover cluster
The availability may be increased significantly by automatic takeover. No manual interventions are required.	This solution is highly complex. Failover clusters are difficult to scale. The resources are always only partially utilised. Additional hardware and software incur high costs.

Table: Advantages and disadvantages of a failover cluster

As shown from the comparison of the advantages and disadvantages, using a failover cluster only makes sense if one or several applications are characterised by very high availability requirements. Along with the high expenditure, the personnel responsible must have very good knowledge regarding the used operating systems and applications and regarding the failover functionality. Furthermore, using failover solutions for servers only makes sense if all dependencies are also designed with the corresponding redundancies such as the network connection or availability of the client.

Areas where failover clusters are typically used in the event of high availability requirements include, for example:

database applications
file storage
dynamic content applications
email servers

If business processes, applications, or services are characterised by high availability requirements, it must be considered how these requirements may be met. The persons responsible for IT and the security management team should draw up a concept and select appropriate architectures for the corresponding servers.

Review questions:

Does the selected server architecture take the availability requirements into consideration?