Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact

Sustainability Factors

Overview of factors
In considering the suitability of particular digital formats for the purposes of preserving digital information as an authentic resource for future generations, it is useful to articulate important factors that affect choices. The seven sustainability factors listed below apply across digital formats for all categories of information. These factors influence the likely feasibility and cost of preserving the information content in the face of future change in the technological environment in which users and archiving institutions operate. They are significant whatever strategy is adopted as the basis for future preservation actions: migration to new formats, emulation of current software on future computers, or a hybrid approach.

Additional factors will come into play relating to the ability to represent significant characteristics of the content. These factors reflect the quality and functionality that will be expected by future users. These factors will vary by genre or form of expression for content. For example, significant characteristics of sound are different from those of still pictures, whether digital or not, and not all digital formats for images are appropriate for all genres of still pictures. These factors are discussed in the sections of this Web site devoted to particular Content Categories.

Disclosure
Disclosure refers to the degree to which complete specifications and tools for validating technical integrity exist and are accessible to those creating and sustaining digital content. Preservation of content in a given digital format over the long term is not feasible without an understanding of how the information is represented (encoded) as bits and bytes in digital files.

A spectrum of disclosure levels can be observed for digital formats. Non-proprietary, open standards are usually more fully documented and more likely to be supported by tools for validation than proprietary formats. However, what is most significant for this sustainability factor is not approval by a recognized standards body, but the existence of complete documentation, preferably subject to external expert evaluation. The existence of tools from various sources is valuable in its own right and as evidence that specifications are adequate. The existence and exploitation of underlying patents is not necessarily inconsistent with full disclosure but may inhibit the adoption of a format, as indicated below. In the future, deposit of full documentation in escrow with a trusted archive would provide some degree of disclosure to support the preservation of information in proprietary formats for which documentation is not publicly available. Availability, or deposit in escrow, of source code for associated rendering software, validation tools, and software development kits also contribute to disclosure.

Back to top

Adoption
Adoption refers to the degree to which the format is already used by the primary creators, disseminators, or users of information resources. This includes use as a master format, for delivery to end users, and as a means of interchange between systems. If a format is widely adopted, it is less likely to become obsolete rapidly, and tools for migration and emulation are more likely to emerge from industry without specific investment by archival institutions.

Evidence of wide adoption of a digital format includes bundling of tools with personal computers, native support in Web browsers or market-leading content creation tools, including those intended for professional use, and the existence of many competing products for creation, manipulation, or rendering of digital objects in the format. In some cases, the existence and exploitation of underlying patents may inhibit adoption, particularly if license terms include royalties based on content usage. A format that has been reviewed by other archival institutions and accepted as a preferred or supported archival format also provides evidence of adoption.

Back to top

Transparency
Transparency refers to the degree to which the digital representation is open to direct analysis with basic tools, including human readability using a text-only editor. Digital formats in which the underlying information is represented simply and directly will be easier to migrate to new formats and more susceptible to digital archaeology; development of rendering software for new technical environments or conversion software based on the "universal virtual computer" concept proposed by Raymond Lorie will be simpler.1

Transparency is enhanced if textual content (including metadata embedded in files for non-text content) is encoded in standard character encodings (e.g., UNICODE in the UTF-8 encoding) and stored in natural reading order. For preserving software programs, source code is much more transparent than compiled code. For non-textual information, standard or basic representations are more transparent than those optimized for more efficient processing, storage or bandwidth. Examples of direct forms of encoding include, for raster images, an uncompressed bit-map and for sound, pulse code modulation with linear quantization. For numeric data, standard representations exist for signed integers, decimal numbers, and binary floating point numbers of different precisions (e.g., IEEE 754-1985 and 854-1987, currently undergoing revision).

Many digital formats used for disseminating content employ encryption or compression. Encryption is incompatible with transparency; compression inhibits transparency. However, for practical reasons, some digital audio, images, and video may never be stored in an uncompressed form, even when created. Archival repositories must certainly accept content compressed using publicly disclosed and widely adopted algorithms that are either lossless or have a degree of lossy compression that is acceptable to the creator, publisher, or primary user as a master version.

The transparency factor relates to formats used for archival storage of content. Use of lossless compression or encryption for the express purpose of efficient and secure transmission of content objects to or from a repository is expected to be routine.

Back to top

Self-documentation
Digital objects that are self-documenting are likely to be easier to sustain over the long term and less vulnerable to catastrophe than data objects that are stored separately from all the metadata needed to render the data as usable information or understand its context. A digital object that contains basic descriptive metadata (the analog to the title page of a book) and incorporates technical and administrative metadata relating to its creation and early stages of its life cycle will be easier to manage and monitor for integrity and usability and to transfer reliably from one archival system to its successor system. Such metadata will also allow scholars of the future to understand how what they observe relates to the object as seen and used in its original technical environment. The ability of a digital format to hold (in a transparent form) metadata beyond that needed for basic rendering of the content in today's technical environment is an advantage for purposes of preservation.

The value of richer capabilities for embedding metadata in digital formats has been recognized in the communities that create and exchange digital content. This is reflected in capabilities built in to newer formats and standards (e.g., TIFF/EP, JPEG2000, and the Extended Metadata Platform for PDF [XMP]) and also in the emergence of metadata standards and practices to support exchange of digital content in industries such as publishing, news, and entertainment. Archival institutions should take advantage of, and encourage, these developments. The Library of Congress will benefit if the digital object files it receives include metadata that identifies and describes the content, documents the creation of the digital object, and provides technical details to support rendering in future technical environments. For operational efficiency of a repository system used to manage and sustain digital content, some of the metadata elements are likely be extracted into a separate metadata store. Some elements will also be extracted for use in the Library's catalog and other systems designed to help users find relevant resources.

Many of the metadata elements that will be required to sustain digital objects in the face of technological change are not typically recorded in library catalogs or records intended to support discovery. The OAIS Reference Model for an Open Archival Information System recognizes the need for supporting information (metadata) in several categories: representation (to allow the data to be rendered and used as information); reference (to identify and describe the content); context (for example, to document the purpose for the content's creation); fixity (to permit checks on the integrity of the content data); and provenance (to document the chain of custody and any changes since the content was originally created). Digital formats in which such metadata can be embedded in a transparent form without affecting the content are likely to be superior for preservation purposes. Such formats will also allow metadata significant to preservation to be recorded at the most appropriate point, usually as early as possible in the content object's life cycle. For example, identifying that a digital photograph has been converted from the RGB colorspace, output by most cameras, to CMYK, the colorspace used by most printing processes, is most appropriately recorded automatically by the software application used for the transformation. By encouraging use of digital formats that are designed to hold relevant metadata, it is more likely that this information will be available to the Library of Congress when needed.

Back to top

External dependencies
External dependencies refers to the degree to which a particular format depends on particular hardware, operating system, or software for rendering or use and the predicted complexity of dealing with those dependencies in future technical environments. Some forms of interactive digital content, although not tied to particular physical media, are designed for use with specific hardware, such as a microphone or a joystick. Scientific datasets built from sensor data may be useless without specialized software for analysis and visualization, software that may itself be very difficult to sustain, even with source code available.

This factor is primarily relevant for categories of digital content beyond those considered in more detail in this document, for which static media-independent formats exist. It is however worth including here, since dynamic content is likely to become commonplace as part of electronic publications. The challenge of sustaining dynamic content with such dependencies is more difficult than sustaining static content, and will therefore be much more costly.

Back to top

Impact of patents
Patents related to a digital format may inhibit the ability of archival institutions to sustain content in that format. Although the costs for licenses to decode current formats are often low or nil, the existence of patents may slow the development of open source encoders and decoders and prices for commercial software for transcoding content in obsolescent formats may incorporate high license fees. When license terms include royalties based on use (e.g., a royalty fee when a file is encoded or each time it is used), costs could be high and unpredictable. It is not the existence of patents that is a potential problem, but the terms that patent-holders might choose to apply.

The core components of emerging ISO formats such as JPEG2000 and MPEG4 are associated with "pools" that offer licensing on behalf of a number of patent-holders. The license pools simplify licensing and reduce the likelihood that one patent associated with a format will be exploited more aggressively than others. However, there is a possibility that new patents will be added to a pool as the format specifications are extended, presenting the risk that the pool will continue far longer than the 20-year life of any particular patent it contains. Mitigating such risks is the fact that patents require a level of disclosure that should facilitate the development of tools once the relevant patents have expired.

The impact of patents may not be significant enough in itself to warrant treatment as an independent factor. Patents that are exploited with an eye to short-term cash flow rather than market development will be likely to inhibit adoption. Widespread adoption of a format may be a good indicator that there will be no adverse effect on the ability of archival institutions to sustain access to the content through migration, dynamic generation of service copies, or other techniques.

Back to top

Technical protection mechanisms
To preserve digital content and provide service to users and designated communities decades hence, custodians must be able to replicate the content on new media, migrate and normalize it in the face of changing technology, and disseminate it to users at a resolution consistent with network bandwidth constraints. Content for which a trusted repository takes long-term responsibility must not be protected by technical mechanisms such as encryption, implemented in ways that prevent custodians from taking appropriate steps to preserve the digital content and make it accessible to future generations.

No digital format that is inextricably bound to a particular physical carrier is suitable as a format for long-term preservation; nor is an implementation of a digital format that constrains use to a particular device or prevents the establishment of backup procedures and disaster recovery operations expected of a trusted repository.

Some digital content formats have embedded capabilities to restrict use in order to protect the intellectual property. Use may be limited, for example, for a time period, to a particular computer or other hardware device, or require a password or active network connection. In most cases, exploitation of the technical protection mechanisms is optional. Hence this factor applies to the way a format is used in business contexts for particular bodies of content rather than to the format.

The embedding of information into a file that does not affect the use or quality of rendering of the work will not interfere with preservation, e.g., data that identifies rights-holders or the particular issuance of a work. The latter type of data indicates that this copy of this work was produced for an specific individual or other entity, and can be used to trace the movement of this copy if it is passed to another entity.

1 For examples of Lorie's treatment of this subject, see his "Long Term Preservation of Digital Information" in E. Fox and C. Borgman, editors, Proceedings of the First ACM/IEEE Joint Conference on Digital Libraries (JCDL'01), pages 346-352, Roanoke, VA, June 24-28 2001, http://doi.acm.org/10.1145/379437.379726; and The UVC: a Method for Preserving Digital Documents: Proof of Concept (December 2002), http://www.kb.nl/hrd/dd/dd_onderzoek/reports/4-uvc.pdf.


Back to top

Last Updated: 01/ 5/2017