S 2.258 Consistent indexing of documents during archiving

Initiation responsibility: Head of IT, Archive Administrator

Implementation responsibility: Archive Administrator, Head of IT, Administrator

When operating the archive, it is important to index all documents and datasets in an unambiguous manner in order to be able to properly find them during later archive queries. Additionally, archive systems offer search query options. Since a full-text search may take a very long time depending on the type and the extent of the archived data, archive systems store a separate dataset containing index information to a separate search database for each document. The structure and extent of the index information can normally be configured and should have the following properties:

Unambiguousness The document identifiers must be unambiguous.
Support of search queries to be expected: The context information is intended to speed up future search queries. Since the future search context is not defined, future search queries can only be estimated beforehand and it can only be attempted to design the context information to be as significant as possible.
Low extent: A low extent of index data speeds up later search queries, but if the extent of index data is too low, search queries may be impaired and/or finding documents may be more difficult. The extent of the context information ultimately must be defined depending on the expected data volume.

As a matter of principle, these parameters must be defined before commissioning the archive. Nevertheless, it may become necessary to change the properties over the course of time. Depending on the extent and type of changes to the index data, this may require very time-consuming re-indexing of the archive databases.

The specific context for individual documents to be archived may be generated differently. Here, three procedures must be differentiated:

Manual generation:

At the document management system level, index information is generated manually for every document using the input mask. This way, particularly with large amounts of data , there is a risk of the index information being inconsistent.
Semi-automatic generation:

These procedures automate the assignment of index data, but provide for manual control and correction options.
Fully automatic generation:

Here, the document indices are assigned fully automatically without any manual intervention.

The selection of the procedure depends on the data volume to be expected. If individual documents are archived irregularly, a manual procedure based on the specific specifications for the generation of a context is sufficient.

If large data volumes are archived regularly, a semi-automatic procedure should be selected for generating the index data. This provides the option of manually controlling and correcting this information before document and document index are archived and might no longer be changeable.

During fully automatic generation of index data, errors cannot be detected and/or corrected. In this case, it is not possible to detect or rule out a possible erroneous assignment of documents to be archived, for example to business processes. Therefore, this procedure should only be used if all documents are structured in such a way that all index data can be extracted without any doubt and reliable in any case.

Review questions:

Have all stored documents and datasets been indexed unambiguously when operating the archive?
Were the structure and the extent of the index information of the archive defined prior to commissioning?
Regular archiving of large data volumes and configurable index information: Is a semi-automatic procedure used for generating the index data?
Fully automatic index data generation: Is it possible to extract all index data reliably and without any doubt?