S 2.354 Use of a high availability SAN configuration

Initiation responsibility: IT Security Officer, Top Management

Implementation responsibility: Head of IT, IT Security Officer

If systems and applications whose data is to be stored in the SAN have very high protection requirements in terms of availability, the use of a high-availability SAN configuration must be taken into consideration.

The term "high availability" in this case means having a high resistance to damage events and is also referred to as "disaster-tolerant". In terms of the stored data of an organisation, this means that a storage system is built with the help of SAN components in such a way that

all data is stored at two locations,
the SAN components at the two locations are connected but not dependent on each other, and
a damage event at one location will not seriously affect the functionality of the components at the second location.

The parameters used to determine whether such an architecture is necessary and appropriate include:

The maximum recovery time (often referred to as the RTO: recovery time objective) specifies the time allowed to expire until the IT systems are available with sufficient functionality to support business processes after a damage event occurs.
The maximum acceptable loss of data (often referred to as the RPO: recovery point objective): The amount of "work lost" after a damage event occurs can be measured based on the age of the last consistent set of data available. The maximum acceptable loss of data basically describes the amount or even the level of complexity of the work that can be recovered within an acceptable amount of time and at an acceptable expense for the organisation.
The affected environment comprises the spatial scope of the damage event. A location, including its systems, only remains useful when it is outside of the sphere of influence of the event.

SAN storage systems are a key technology used to meet very high availability requirements for IT systems:

When equipped with high performance links, the systems can be physically separated far enough from one another that it is possible to meet protection objectives even in the case of a large-scale event.

The high performance link can be used to keep the maximum possible amount of data lost to a minimum.

However, the maximum downtime of an application can only be controlled to a limited extent by the SAN configuration. Since the downtime can only be measured from the point of view of the users, the downtime does not only depend on the availability of the stored data, but equally so on the availability of the remaining IT infrastructure (servers, network, PCs,...) supplied with data by the SAN components.

Configuration capabilities

There are various ways to configure a SAN system for high availability.

Mirroring by the server

The easiest method of achieving high availability SAN usage is provided when a server storing its data in a SAN storage is connected to a second storage system at a separate location.

Every write access by the server is performed on both storage systems. The disadvantage of this solution is that the "storage" instance is configured in part on the server.

This means that administration is performed on the server. The advantage of central storage systems, i.e. that they can also be administered centrally, is therefore wasted. In addition, the cabling design is more complex, since each server is connected to both storage systems. In simple terms, a second cable must be installed to connect the server directly to the second, remote storage system in addition to installing the cable for the connection between the server and the primary storage system.

Replication

Replication can be performed by the server or by the storage system.

Server-based replication is generally implemented by separate software, the application, or the operating system. However, this approach usually results in a high load on the CPU, main memory, and bandwidth.

When replication is performed by the storage system, the servers are connected to a storage system, and this storage system synchronises its data completely or according to its configuration with an additional storage system installed at a remote location.

If the locations are close enough to each other, synchronous data replication is possible. "Synchronous data replication" means that each write access by the server is only acknowledged as complete by the storage disk it is directly connected to after the second, remote storage system has sent the first storage system confirmation that the data has been recorded successfully.

This means that hard disk accesses are slower from the point of view of the server, since two disk systems need to write the data and because the time for the signals to transmit between the storage systems at location A and location B must also be included.

When asynchronous data replication is used, special replication software on the storage systems ensures that the storage system at location A regularly transmits its changed data to the storage system at location B.

The server in this case now has a brakeless storage system available. Another advantage of this method is that a government agency and/or company is not forced any more to have two exactly identical storage systems available at two different locations for contingency planning. Instead, one high-performance system is installed at the main location. A more economical system can then be installed at the second location so that the main tasks to be performed in case of an emergency are still guaranteed.

The disadvantage of asynchronous replication is that the second storage system always has an older set of data. The amount of data lost when the primary system fails depends on the technology used.

The use of synchronous data replication of storage systems only makes sense when redundant server systems are also available that can take over operations directly. A situation in which the storage system at one location fails completely but the connected servers and network components (e.g. the components of the SAN) do not is rather seldom.

When planning a high availability SAN configuration, the entire contingency planning concept for the IT systems of the organisation must be checked first. The availability requirements for the SAN and the connected servers must be specified in writing.

The planning of a high-availability SAN, adapted to the requirements and risk policies of the organisation, is only the first step towards high availability. At the same time, the future development of the entire IT environment and the emergency plan for the organisation also must be planned appropriately.

The use of a high-availability SAN only makes sense when there are also servers available for recovery and when the users at intact workstations have access to the data over a functioning network.

It must be noted that a test and consolidation system must also be included in a high-availability SAN. Configuration changes and software updates must never be performed directly on the productive system when a high-availability configuration is used. The organisation must have reserve systems that all changes can be tested on. This is the only way to ensure that operations are not endangered by administrative intervention.

Review questions:

Has it been ensured that a damage event at one location will not affect the functionality of the SAN components at the second location?
Has the contingency concept of the organisation been incorporated in planning the high-availability SAN configuration?
Is there a test and consolidation system for the high-availability SAN system with regard to configuration changes and software updates?