S 2.314 Use of high-availability architectures for servers
Initiation responsibility: Head of IT, IT Security Officer
Implementation responsibility: IT Security Officer, Administrator
The availability of business processes, applications, and services often depends on the function of a central server. However, the more applications are run on a server, the more reliable this server must be. Normally, a server contains different potential sources of error (single points of failure), i.e. components the failure of which may cause the overall system to fail: CPU, hard disks, power supplies, fans, back plane, etc. The restoration of the overall system may take considerable amounts of time in this case. Along with the provision of spare parts, the following options may additionally be used in order to increase the availability:
- cold standby
- hot standby (manual switchover)
- cluster (automatic switchover)
-
- load-balanced cluster
- failover cluster
Every single one of these techniques offers a different level of availability and is normally related to different costs.
Cold standby
For cold standby, a secondary replacement system identical in construction is provided parallel to the actual productive system, which is not active, however. Should the primary system fail, the replacement system can be booted and integrated into the network manually.
Along with the provision of individual spare parts, this is the simplest redundancy solution entailing the corresponding advantages and disadvantages:
Advantages of a cold standby solution | Disadvantages of a cold standby solution |
---|---|
|
|
Table: Advantages and disadvantages of a cold standby solution
These solutions are well suited for servers containing applications where short and/or limited downtimes until administrator intervention are uncritical. Examples include:
- servers in smaller networks (intranet)
- sparsely frequented servers on the internet
Hot standby (manual switchover)
For hot standby, a replacement system must also be provided that is, however, maintained in operation parallel to the productive system. The functionality of the productive system is monitored and the replacement system is activated in the event of a failure. Switchover may be manual or automatic. The overall system must comprise additional functionalities for automatic switchover, e.g. automatic recognition of failures. This case is addressed in the next section in "Cluster".
In order to ensure that the downtimes are as short as possible, the condition of the replacement system must be checked continuously.
Advantages of a hot standby solution | Disadvantages of a hot standby solution |
---|---|
|
|
Table: Advantages and disadvantages of a hot standby solution.
Using hot standby systems is suitable for applications for which short downtimes are uncritical. The problem of system monitoring during activation of the hot standby server must be taken into consideration in this. For example, possible fields of application include:
- web servers with frequently varying content
- servers in small networks (application servers, mail servers)
- database servers and file servers (e.g. a secondary server continuously replicates a primary server and becomes the primary server in the event of a failure).
Cluster (automatic switchover)
A cluster consists of a group of two or more computers operated in parallel in order to increase the availability or the performance of an application or a service. In this, the application or service may be executed actively on one of the computers or distributed to several computers (performance enhancement).
Clusters are differentiated regarding
- load-balanced clusters and
- failover clusters
depending on the mode of operation.
Load-balanced clusters
For load-balanced clusters, instances of an application or of a service are distributed amongst the servers depending on the utilisation. If this is possible for an application or a service, this cannot only be used to achieve load balancing and therefore performance enhancement, but also to reduce the problems occurring during failures.
One of the prerequisites for using load balancing is that the respective applications or services must not require write data access.
In this case, redundancy may be provided by installing systems with similar performances "next to each other" with the help of a load balancing process and by guaranteeing that the other servers will compensate the failure of one server.
Advantages of a load-balanced cluster | Disadvantages of a load-balanced cluster |
---|---|
|
|
Table: Advantages and disadvantages of a load-balanced cluster
If, along with the availability, performance is important and if the application allows for distributed use, a load-balanced cluster is the ideal solution. This may be the case for the following, for example:
web servers, frontend applications with exclusive read accesses (e.g. web server farms) failover clusters
In this document, the term failover cluster refers to a cluster where active operation of the application or service is taken over automatically by another part of the cluster in the event of one of the cluster systems failing. The term failover refers to the automatic takeover of services during the failure of a system component by a functionally equivalent component. For the failover function, a dedicated heartbeat connection is usual ensuring the communication between the cluster servers. Along with the connection to the client network, the cluster servers must also be connected to the administration network in a dedicated manner in order to provide for direct access in the event of an emergency.
Automatic failover assumes that all software and hardware components are monitored appropriately. Therefore, it is important to ensure that the failover mechanism is not based on any incorrect assumptions.
The following items must be taken into consideration when using a failover cluster:
- Access to shared memory:
- Along with the server's own hard disks containing the operating system and the data required for operations, it is recommendable to manage the application data on a shared memory in a cluster.
The part of the cluster currently active is provided with access to these hard disks. It is also possible to use replicated hard disks instead of shared hard disks. This makes sense if the failover is performed from a remote location. During local failover, it should be considered whether the complexity created by replication and the dependencies relating to the aforementioned do not constitute an additional threat for availability. - Portability of the application:
Installing and commissioning an application in parallel on two or more servers requires the use of additional licenses in most cases. Furthermore, it must be checked whether the application allows for a failover functionality. - NSPoF (No Single Points of Failure):
If the failover functionality of the cluster may be affected adversely by the failure or a single component, this is contradictory to the actual purpose of the cluster architecture. In order to avoid single points of failure, the overall system must be analysed and the failure of individual components (power adapters, system memories, main memories, network cards, switches, hubs, etc.) must be taken into account. - Operating system and configuration of the cluster servers:
The cluster servers should be equipped with identical operating system versions, patches, libraries, and application versions. If the hardware and software configurations are as identical as possible, this guarantees that the behaviour in the event of a failover is as identical as possible. Therefore, identical systems reduce the complexity of the overall system (use of the same failover software, network interfaces, compatibility of the joint memory system, administration, service). - Dedicated and redundant connection between the servers:
Communication between the cluster servers must be as instantaneous as possible regardless of the network load so that the failover can be performed as quickly as possible. Redundancy is also required due to the high availability requirements. - Use of sophisticated software products for failover management:
The decision as to whether or not failover must be performed is very complex. New and self-developed tools may contain errors and thereby ultimately reduce the availability of the overall system. - Comprehensive testing of all possible failover aspects:
Amongst other things, comprehensive tests are required in order to determine that no unexpected single points of failure are present. In particular, server monitoring and failover management must be checked for any possible errors.
Advantages of a failover cluster | Disadvantages of a failover cluster |
---|---|
|
|
Table: Advantages and disadvantages of a failover cluster
As shown from the comparison of the advantages and disadvantages, using a failover cluster only makes sense if one or several applications are characterised by very high availability requirements. Along with the high expenditure, the personnel responsible must have very good knowledge regarding the used operating systems and applications and regarding the failover functionality. Furthermore, using failover solutions for servers only makes sense if all dependencies are also designed with the corresponding redundancies such as the network connection or availability of the client.
Areas where failover clusters are typically used in the event of high availability requirements include, for example:
- database applications
- file storage
- dynamic content applications
- email servers
If business processes, applications, or services are characterised by high availability requirements, it must be considered how these requirements may be met. The persons responsible for IT and the security management team should draw up a concept and select appropriate architectures for the corresponding servers.
Review questions:
- Does the selected server architecture take the availability requirements into consideration?