S 6.93 Contingency planning for z/OS systems

Initiation responsibility: IT Security Officer, Head of IT

Implementation responsibility: Administrator

Secure z/OS operations include being prepared for different emergencies. These include, for example

an emergency user procedure necessary if no ID with a certain functionality is available any more,
a procedure for restoring a functional RACF database,
a z/OS backup system that can be activated immediately, and
a business continuity system that may be required for stand-alone systems in order to be able to correct errors.

The different recommendations for action in the field of contingency planning are described in more detail in the following:

Emergency user procedure

An emergency user procedure must be established for contingency planning. This emergency user can be used if no RACF administrator (Resource AccessControl Facility) is available in an emergency and/or if all IDs with SPECIAL rights are disabled. One or several emergency user IDs may be configured.

The following rules must be observed:

Access to the emergency user ID

Since the emergency user ID has very high authorisations (SPECIAL) in the system, the emergency user ID must be granted restrictively.

The emergency user must only be accessible to persons specified in advance. It should only be available to RACF administrators and system programmers with RACF training.

Reporting and documenting the use of the emergency user

When using the emergency user, the RACF administration, the auditor, and the security management team must be informed as soon as possible. The following information must be reported:

Who used the emergency user?
What was the emergency user needed for?
When did the access happen?
What were the authorisations of the emergency user used for?

All procedures regarding the emergency user ID must be documented and archived comprehensibly.

Password of the emergency user ID

When logging in with the emergency user ID, the user must immediately change the password to a new one. This is enforced by RACF if the emergency user was equipped with a new initial password.

After having used the emergency user ID, the related password must be reset and stored by the RACF administration.

Misuse of the emergency user procedure

The emergency user procedure must not be misused to extend one's authorisations when there is no emergency. It must be prevented that the emergency user is used for reasons of convenience in order to bypass defined administration and decision-making paths.

Preventing the emergency user blocking

All IDs can be blocked after a specified period of inactivity. The corresponding setting is made in the SETROPTS parameters of RACF. Such a blocking may also affect emergency user IDs if these are not used for extended periods of time. Consideration must be given to preventing this automatic blocking by using a batch job. The batch job should use the emergency user IDs regularly (e.g. once a month). This way, the time stamps in the RACF database are updated. This batch job can be initiated using a job scheduler. It must be ensured that the password of the emergency user is not disclosed to anyone except the employees expressly authorised in this respect. For this, the RACF class SURROGAT should be used so that no password need be configured in the Job Control Language.

Procedures for restoring z/OS RACF databases

The RACF database is the most important and central storage location for the security settings of a z/OS system. If secure operations are to be guaranteed, the RACF database must work properly. In order to counteract problems related to unavailable or faulty RACF databases, the following recommendations must be taken into consideration:

Backing up the RACF databases

It is important that the synchronisation of the RACF databases works properly. Therefore, in order to backup active databases (the databases identified as being active in the RVARY display), either the RACF utility IRRUT200 (recommended by IBM) or IRRUT400 must be used at all times.

During the backup, numerous LOCK functions are executed. Therefore, the batch job performing the backup should be scheduled in a time window with the lowest possible utilisation.

The backups must not be stored to the same hard disk the RACF databases are also operated on.

It should be considered to keep several generations of the backups. The weekend must also be taken into account.

The backup copies of the databases must be protected using corresponding RACF profiles, as well as the RACF databases themselves (see S 4.211 Use of the z/OS security system RACF).

RACF database recovery

In the z/OS system, there is a primary and a backup RACF database. These may be switched during operation. For reasons of security, the two databases must be stored to different disks. If errors occur in the primary database, a RVARY SWITCH command can be used to switch the RACF database backup to primary and primary to backup. Then, the faulty backup RACF database can normally be deleted and replaced by a new one.

If both RACF databases are faulty, in this emergency it is possible to replace the faulty RACF database by a valid backup copy and to restore system operation this way (possibly from another system). For stand-alone systems, a business continuity system is required for this (see below: Setting up a z/OS business continuity system).

Comprehensibility in the event of an error

A procedure for backing up and restoring the RACF database must be set up.

A procedure must be set up so that changes to the RACF database performed during the time between the most recent backup of the RACF database and the time of the occurred emergency can be traced. For example, allowing the performance of RACF changes only by means of documented batch jobs is one possibility for the aforementioned. Analysing the SMF records directly after RACF changes constitutes an alternative. Both procedures must be documented comprehensibly. The documentation must be available to the administrators.

z/OS backup system

If the z/OS system (or even an entire parallel Sysplex cluster) cannot be started after system errors, it is important to return the system and/or systems to an operational state as quickly as possible. For example, such failures may occur due to a technical error or also due to incorrect manual input. Therefore, a separate set of hard disks should be provided containing a copy of the current operating system. This way, a z/OS operating system can be reactivated quickly in most cases by simply changing the IPL address (Initial Program Load). The following recommendations must be taken into account in this connection:

Hard disk concept

The hard disk concept for the z/OS operating system and the related program products (such as schedulers, output managers, and such like) must be designed logically and be clearly discernible. Files belonging together, e.g. of the operating system, must not be stored in such a way that they are distributed to a large number of different hard disks. The number of hard disk used should be as low as possible so that complete backups can be performed relatively easily.

Cloning process

A cloning process performing at least the following actions should be set up for creating the backup hard disk:

copying the system residences,
copying the program product hard disks,
copying the HFS hard disks (Hierarchical File System),
copying the SMP/E hard disks (System Modification Program),
changing the volume information in SMP/E using the ZONEEDIT function (replacing old volume information by new), and
adaptation of the volume information in the IEASYMnn member of the Parmlib.

Maintenance concept

In order to not endanger live operations, a separate set of hard disks is normally used in order to service the z/OS operating system. It must be considered to use this set as new active disk set and to use the disks used previously as a backup set upon completion of the maintenance work.

Use of system variables

In order to simplify the definitions, symbolic variables should be used wherever technically possible and reasonable (in z/OS 1.4 and higher, up to 800 such variables can be defined). It should be considered to variably design the catalogue entries of the master catalogue and its ALIAS entries by using such techniques so that switchover is possible at any time without any interventions. Using symbolic variables is possible in numerous definitions; however, it should be taken into consideration that some definitions do not provide any support for the variables yet.

Maintenance of work files

In order to avoid unnecessary maintenance work, work files such as catalogues, Parmlibs, Proclibs, and databases of program products should not be maintained in duplicate or even more versions.

Setting up a z/OS business continuity system

As a consequence of errors in decisive software components, e.g. RACF (Resource Access Control Facility) or master catalogue, the entire system may fail. For stand-alone systems, a business continuity system must be available on short notice for this case that can be started without any major problems and enables repair of the faulty system.

As opposed to backup systems, the business continuity system is not intended for production operations. The following information must be taken into account when setting up business continuity systems:

Independence

The business continuity system must be configured completely independently of the files and definitions of the production systems.

Reduction to the essential

The business continuity system should not contain more software functions than absolutely required for the repair work so that not more than one hard disk is required for the system. This includes the programs JESx (Job EntrySubsystem), VTAM (Virtual Telecommunication Access Method), and TSO (Time Sharing Option) with the related ISPF files (Interactive Support Programming Facility). It must be considered whether a system without JES is sufficient. However, no batch jobs can be used in this case.

Volume information

All procedures must be equipped with volume information in order to avoid dependencies on catalogues. Therefore, no SMS files (System Managed Storage) should be used either.

VTAM terminals

The VTAM procedure set up must be as simple as possible, within the framework of which at least one VTAM Local Node containing the address of an MCS console (Multiple ConsoleSupport) is provided. This makes establishing a VTAM connection and logging in to the faulty system possible. In the event of changes to the VTAM configurations, the definition of the VTAM Local Node must be updated accordingly.

Components of the business continuity system

The business continuity system should be installed on a hard disk containing the following files and components as a minimum:

IPL text,
master catalogue,
JESx checkpoint and spool file,
page dataset,
system files (MANx, STGINDEX, LOGREC, DAE),
Parmlib, Proclib (do not forget the logon procedure),
SMF files (SYS1.MANx),
BROADCAST and UADS files, and
RACF database.

User IDs for cases of emergency

At least two user IDs must be available on the business continuity system that are handled like the emergency user.

Permanent service

Site and system access to the business continuity system must be controlled. Changes to the normal system must also be performed promptly in the business continuity system, if they are relevant for the business continuity system. The functionality of the business continuity system must be checked periodically.

Review questions:

Has an emergency user procedure been established for the z/OS systems?
Are all procedures regarding the emergency user ID documented and archived comprehensibly during contingency planning for the z/OS system?
Are backup copies of the database, as well as the RACF databases themselves protected using corresponding RACF profiles?
Has a procedure for backing up and restoring the RACF database been set up?
Is a z/OS backup system held available on a separate set of hard disks containing a copy of the current operating system?
Has a z/OS business continuity system been set up for stand-alone systems?