S 4.170 Selection of suitable data formats for the archival storage of documents
Initiation responsibility: Head of IT
Implementation responsibility: Head of IT, Administrator
When archiving electronic documents, it is necessary to select suitable data archive formats. The data format should allow long-term reproduction of the original version of the archive data as well as reproduction of selected characteristics of the original document medium (for example the paper format, colours, logos, number of pages, watermarks, signatures). The data formats currently used for archiving are suited to different purposes, and the suitability of each format depends highly on the reasons for archiving the data and on the original data media. When switching media and data formats, though, it is generally impossible to represent all structural features of the original medium at the same time.
Since it is usually impossible in advance to predict which characteristics of the original document will need to be verified and with which level of certainty when reproducing the document later, documents are normally archived simultaneously in several different electronic data formats. This is intended to ensure that the highest possible number of characteristics of the original document is archived. The conversion process is often referred to as rendition.
The following points out the primary criteria to apply when selecting suitable data formats:
- The data format should remain relevant for as long as possible
- It should be possible to unambiguously interpret the document structure
- It should be possible to process the contents of the document electronically
- All statutory regulations must be followed
- The grammar and syntax of the data format must be documented in detail so that the data format can be migrated later without any problems
- It should be possible to verify the characteristics of the original document (e.g. electronic or paper form) with complete certainty, even when the original document is not available any more
For paper documents, a graphic representation of the document is usually archived in addition to a structural representation (in a structure description language). Under some circumstances, electronic signatures are also archived to verify the authenticity of the corresponding documents.
Some typical data formats are described and their suitability for use in electronic archives is discussed in the following sections.
A. Structure formats
SGML
SGML (Standard Generalized Markup Language) is a document description language that describes the logical structure and contents of electronic documents. SGML is standardised in ISO Standard 8879.
In addition to the structure (syntax) of documents, SGML describes in particular the semantics of the structure elements of the electronic document. However, SGML does not specify the actual presentation or format of the contents of a document when a document is displayed or otherwise reproduced.
The following are the most important features of SGML:
- The semantics of the SGML elements are defined separately in the DTD (Document Type Definition). The DTD is used as the basis for exchanging documents between organisations and applications.
- SGML is suitable for use as an independent representation of structured text documents and for storing these documents since the layout information is handled separately from the contents of the document.
- SGML can be used directly to reproduce the structures in document management systems.
SGML can be used as a format for the long-term archiving of electronic documents. When archiving, though, it is absolutely necessary to archive the semantic specification (DTD) as well. Since SGML does not contain any layout information, it is recommended for SGML documents to archive a graphic representation of the original document as well, for example in the TIFF format.
HTML
HTML (Hyper Text Markup Language) is a structure description language for electronic documents. HTML is based on a subset of the SGML description elements and has become the standard for presenting and exchanging documents in the World Wide Web.
HTML offers a very limited number of possible structural features for documents and can be understood as a special version of SGML with an implicit DTD.
The following are the most important features of HTML:
- With HTML, parts of a document can be integrated to form an overall document structure using hyperlinks. As a result, it is possible to integrate images and text sections that are physically stored on distributed servers into the text flow. Due to the dynamic integration of text and images, it is possible for parts of the overall document to change without the knowledge of the owner of the overall document if linked sections or images are changed or are not accessible any more.
- HTML is restricted to the currently existing structural features. It is not possible to modify or expand the syntax or semantics of the HTML tags.
- Due to the lack of flexibility of HTML, it is necessary to revise the HTML standard when the requirements change. The standardisation body responsible for revising the HTML standard (W3C Consortium) has revised the standard regularly in the last few years. In addition, extensions have been implemented by the manufacturers of HTML browsers. It can be assumed that new extensions to the language will be issued constantly in the future as well.
HTML is not recommended as a format for long-term archiving. It is not suitable for archiving purposes since the release of extensions to the HTML standard can be expected in even shorter intervals in the future due to the lack of syntactic and semantic flexibility.
Due to the dynamic structure of HTML documents, HTML is also unsuitable for archiving purposes because it is necessary to archive the entire document, including all linked images, subdocuments, and cross-references. When archiving HTML documents, no active links to non-archived parts of the document should be present in the document since it is impossible to ensure that such external parts of the document will be available later when the document is reproduced.
XML
Due to the restricted functionality of HTML, the W3C has made it possible for users to utilise the advantages of the SGML language while reducing its complexity at the same time. XML was developed as a subset of SGML.
The following are the most important features of XML:
- In contrast to HTML, it is possible in XML to define new tags and attributes. As a result, it is possible to modify the syntax and semantics of the description elements.
- As with HTML, links can be integrated into the document structure. It is therefore easy to reference existing documents and embed images in documents, for example.
- XML can be displayed directly in newer web browser versions. To display a document, a separate definition of its layout is required using the XSL (Extensible Stylesheet Language) description language.
XML can be used as a format for the long-term archiving of electronic documents. When archiving a document, though, it is absolutely necessary to archive the semantic specification (DTD or Document Type Definition) and possibly even the description of the layout information in XSL.
PDF (Portable Document Format) is a document format in which the most important layout information of an electronic document is stored together with the structural information.
PDF was developed by Adobe based on the PostScript data format.
The appearance of a PDF document is described by a stream of data containing a series of graphical objects. A document is fully specified by this description. The decision of how the document should appear when displayed is made in this case when the document is created, and the resulting appearance is then permanent. Documents in the PDF format usually require significantly less storage space than documents stored as an image (i.e. represented by pixels).
The goal when using PDF is to maintain the appearance of an electronic document regardless of the application software, the hardware platform, or the operating system used to create the document. The PDF format is therefore primarily suitable for archiving documents in which the appearance is intended to match that of its appearance on paper or that have the character of letters and business documents.
PDF/A is a version of PDF designed especially to meet the requirements the long-term archiving that was standardised in ISO 19005-1:2005. PDF/A (the "A" stands for archiving in this case) defines a stable subset of PDF that can be used to describe the documents to be archived so that the file itself contains all information necessary in an unambiguous, accessible, and usable form.
PDF/A can be used as a format for the long-term archiving of electronic documents. When used, the documents must be examined to ensure they conform to the PDF/A specification.
B. Image formats
TIFF
The TIFF format (Tagged Image File Format) is used to store raster images (bitmaps). A TIFF file consists of a file header and the image information. The header contains tags that specify the properties of the image recorded, for example its resolution or the compression method used.
The following are the most important features of TIFF:
- Image information can be stored without any loss of information in black and white as well as grey scale, but only when a colour depth of 24 bits (true colour) is selected. This is because it is only possible to reproduce all shades of grey in grey scale when using this colour depth. To record and store colour information true to the original colours, though, it is necessary to adjust the optical sensors regularly so that the colour information is not distorted due to colour shifts. This can be achieved by comparing the colours to the reference colour white, for example.
- All common graphics and presentation programs support the TIFF format. In addition, it is also supported by archive and workflow systems.
- Fax machines commonly use TIFF as their standard data format.
- The image data can be stored in compressed form. TIFF is compatible with most compression methods. Two of the most important compression methods are illustrated briefly below:
- ITU/CCITT - Group 4:
ITU compression takes TIFF as the input format. With normal text documents, a compression factor of around 1:40 is achieved. It is thus ideally suited for black-and-white documents.
Compression is lossless. ITU compression is a global standard in the area of archiving. - JBIG:
JBIG is a lossless compression technique for black-and-white images in TIFF format. It is standardised in ISO/IEC standard 11544. Compared with ITU Group 4 compression, it is up to 70% more effective.
JBIG is not currently as widely used as the ITU method and is not supported by every manufacturer.
- ITU/CCITT - Group 4:
In compressed form, TIFF is suitable for use as a format for the long-term archiving of images and graphical representations of documents. It is recommended to use a lossless compression method such as ITU/CCITT Group 4 in order to minimise the amount of storage space needed.
GIF
The GIF format (Graphics Interchange Format) is used to store bitmap images.
The following are the most important features of GIF:
- All common graphics and presentation programs support the GIF format. In addition, it is also supported by archive and workflow systems.
- Data is lost when converted to the GIF format, but the sizes of the files containing the image information are smaller.
- A license is required to use the GIF format in applications.
The use of the GIF format is not recommended for long-term archiving, but GIF can be used for short- and medium-term archiving.
JPEG
JPEG was developed by the Joint Photographic Experts Group and is especially suitable for colour and grey-scale images. In this area, the JPEG compression method is just as effective as ITU Group 4 compression.
JPEG can be configured differently depending on the settings of a few parameters. Different compression rates can then be selected depending on their settings. However, image data can also be lost in this case.
The following are the most important features of JPEG:
- All common graphics and presentation programs support the JPEG format.
- At some levels of compression, conversion to the JPEG format results in losses of information, and essential image information can be lost, although the file sizes are smaller in this case.
JPEG is a suitable format for the long-term archiving of images and graphical representations of documents. For audit-proof archiving, it is recommended when selecting the compression level to select a lossless compression level.
C. Audio and video formats
When processing audio and video data digitally, very large amounts of data can be generated quickly even for short recordings. For this reason, effective compression is especially important.
Lossless compression methods for audio and video data can currently only reach compression rates of about 2:1. It is much more common to use a method that achieves a compression rate of up to 200:1 but which does not operate without a loss of information. The (in some cases considerable) loss of information resulting from compression is typically accepted as long as it is not perceptible to the human eye or ear or is not considered troublesome.
The suitability of lossy compression methods for archiving video and audio material must be examined based on the application.
The following presents a few typical formats:
MPEG
The Motion Pictures Expert Group (MPEG) is responsible in the ISO for developing global standards for the compression of digitised moving images.
There are currently three different MPEG methods available:
- MPEG1: There are three different layers of this format available. Layer 3 is known as MP3 for short and is widely used to compress audio data.
- MPEG2: This format is currently used to store video data on DVD and is accepted as a standard.
- MPEG4: This format is still in development and a standard has not been finalised yet.
ITU H.261
In 1990, the H.261 standard was adopted by the ITU for the coding of video signals. H.261 coding was developed and optimised for transmission over ISDN channels.
ITU H.263
ITU standard H.263 is a refinement of the H.261 standard approved in 1995/96. It was originally developed for data rates up to 64kbit/s. This restriction no longer exists today. The quality of images was increased significantly compared to the quality available in the H.261 standard while simultaneously improving compression considerably.
Review questions:
- Does the selected data format allow long-term reproduction of the original version of the archive data as well as reproduction of selected characteristics of the original document medium?
- Is it possible to unambiguously interpret and to process the document structure of the selected data format electronically?
- Are the syntax and semantics of the data formats used documented for archiving?
- Is a lossless image compression method used for audit-proof archiving?