S 4.170 Selection of suitable data formats for the archival storage of documents

Initiation responsibility: Head of IT

Implementation responsibility: Head of IT, Administrator

When archiving electronic documents, it is necessary to select suitable data archive formats. The data format should allow long-term reproduction of the original version of the archive data as well as reproduction of selected characteristics of the original document medium (for example the paper format, colours, logos, number of pages, watermarks, signatures). The data formats currently used for archiving are suited to different purposes, and the suitability of each format depends highly on the reasons for archiving the data and on the original data media. When switching media and data formats, though, it is generally impossible to represent all structural features of the original medium at the same time.

Since it is usually impossible in advance to predict which characteristics of the original document will need to be verified and with which level of certainty when reproducing the document later, documents are normally archived simultaneously in several different electronic data formats. This is intended to ensure that the highest possible number of characteristics of the original document is archived. The conversion process is often referred to as rendition.

The following points out the primary criteria to apply when selecting suitable data formats:

For paper documents, a graphic representation of the document is usually archived in addition to a structural representation (in a structure description language). Under some circumstances, electronic signatures are also archived to verify the authenticity of the corresponding documents.

Some typical data formats are described and their suitability for use in electronic archives is discussed in the following sections.

A. Structure formats

SGML

SGML (Standard Generalized Markup Language) is a document description language that describes the logical structure and contents of electronic documents. SGML is standardised in ISO Standard 8879.

In addition to the structure (syntax) of documents, SGML describes in particular the semantics of the structure elements of the electronic document. However, SGML does not specify the actual presentation or format of the contents of a document when a document is displayed or otherwise reproduced.

The following are the most important features of SGML:

SGML can be used as a format for the long-term archiving of electronic documents. When archiving, though, it is absolutely necessary to archive the semantic specification (DTD) as well. Since SGML does not contain any layout information, it is recommended for SGML documents to archive a graphic representation of the original document as well, for example in the TIFF format.

HTML

HTML (Hyper Text Markup Language) is a structure description language for electronic documents. HTML is based on a subset of the SGML description elements and has become the standard for presenting and exchanging documents in the World Wide Web.

HTML offers a very limited number of possible structural features for documents and can be understood as a special version of SGML with an implicit DTD.

The following are the most important features of HTML:

HTML is not recommended as a format for long-term archiving. It is not suitable for archiving purposes since the release of extensions to the HTML standard can be expected in even shorter intervals in the future due to the lack of syntactic and semantic flexibility.

Due to the dynamic structure of HTML documents, HTML is also unsuitable for archiving purposes because it is necessary to archive the entire document, including all linked images, subdocuments, and cross-references. When archiving HTML documents, no active links to non-archived parts of the document should be present in the document since it is impossible to ensure that such external parts of the document will be available later when the document is reproduced.

XML

Due to the restricted functionality of HTML, the W3C has made it possible for users to utilise the advantages of the SGML language while reducing its complexity at the same time. XML was developed as a subset of SGML.

The following are the most important features of XML:

XML can be used as a format for the long-term archiving of electronic documents. When archiving a document, though, it is absolutely necessary to archive the semantic specification (DTD or Document Type Definition) and possibly even the description of the layout information in XSL.

PDF

PDF (Portable Document Format) is a document format in which the most important layout information of an electronic document is stored together with the structural information.

PDF was developed by Adobe based on the PostScript data format.

The appearance of a PDF document is described by a stream of data containing a series of graphical objects. A document is fully specified by this description. The decision of how the document should appear when displayed is made in this case when the document is created, and the resulting appearance is then permanent. Documents in the PDF format usually require significantly less storage space than documents stored as an image (i.e. represented by pixels).

The goal when using PDF is to maintain the appearance of an electronic document regardless of the application software, the hardware platform, or the operating system used to create the document. The PDF format is therefore primarily suitable for archiving documents in which the appearance is intended to match that of its appearance on paper or that have the character of letters and business documents.

PDF/A is a version of PDF designed especially to meet the requirements the long-term archiving that was standardised in ISO 19005-1:2005. PDF/A (the "A" stands for archiving in this case) defines a stable subset of PDF that can be used to describe the documents to be archived so that the file itself contains all information necessary in an unambiguous, accessible, and usable form.

PDF/A can be used as a format for the long-term archiving of electronic documents. When used, the documents must be examined to ensure they conform to the PDF/A specification.

B. Image formats

TIFF

The TIFF format (Tagged Image File Format) is used to store raster images (bitmaps). A TIFF file consists of a file header and the image information. The header contains tags that specify the properties of the image recorded, for example its resolution or the compression method used.

The following are the most important features of TIFF:

In compressed form, TIFF is suitable for use as a format for the long-term archiving of images and graphical representations of documents. It is recommended to use a lossless compression method such as ITU/CCITT Group 4 in order to minimise the amount of storage space needed.

GIF

The GIF format (Graphics Interchange Format) is used to store bitmap images.

The following are the most important features of GIF:

The use of the GIF format is not recommended for long-term archiving, but GIF can be used for short- and medium-term archiving.

JPEG

JPEG was developed by the Joint Photographic Experts Group and is especially suitable for colour and grey-scale images. In this area, the JPEG compression method is just as effective as ITU Group 4 compression.

JPEG can be configured differently depending on the settings of a few parameters. Different compression rates can then be selected depending on their settings. However, image data can also be lost in this case.

The following are the most important features of JPEG:

JPEG is a suitable format for the long-term archiving of images and graphical representations of documents. For audit-proof archiving, it is recommended when selecting the compression level to select a lossless compression level.

C. Audio and video formats

When processing audio and video data digitally, very large amounts of data can be generated quickly even for short recordings. For this reason, effective compression is especially important.

Lossless compression methods for audio and video data can currently only reach compression rates of about 2:1. It is much more common to use a method that achieves a compression rate of up to 200:1 but which does not operate without a loss of information. The (in some cases considerable) loss of information resulting from compression is typically accepted as long as it is not perceptible to the human eye or ear or is not considered troublesome.

The suitability of lossy compression methods for archiving video and audio material must be examined based on the application.

The following presents a few typical formats:

MPEG

The Motion Pictures Expert Group (MPEG) is responsible in the ISO for developing global standards for the compression of digitised moving images.

There are currently three different MPEG methods available:

ITU H.261

In 1990, the H.261 standard was adopted by the ITU for the coding of video signals. H.261 coding was developed and optimised for transmission over ISDN channels.

ITU H.263

ITU standard H.263 is a refinement of the H.261 standard approved in 1995/96. It was originally developed for data rates up to 64kbit/s. This restriction no longer exists today. The quality of images was increased significantly compared to the quality available in the H.261 standard while simultaneously improving compression considerably.

Review questions: