Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Content Categories >> Still Image | Sound | Textual | Moving Image | Web Archive | Datasets | Email and PIM | Design and 3D | Geospatial | Aggregate | Generic

Sound >> Quality and Functionality Factors


Scope
This discussion concerns media-independent sound content in two format categories. The first category consists of formats that represent recorded sound, often called waveform sound. Such formats are employed for applications like popular music recordings, recorded books, and digital oral histories. The second category consists of formats that provide data to support dynamic construction of sound through combinations of software and hardware. Such software includes sequencers and trackers that use data that controls when individual sound elements should start and stop, attributes such as volume and pitch, and other effects that should be applied to the sound elements. The sound elements may be short sections of waveform sound (sometimes called samples or loops) or data elements that characterize a sound so that a synthesizer (which may be in software or hardware) or sound generator (usually hardware) can produce the actual sound. The data are brought together when the file is played, i.e., the sounds are generated in a dynamic manner at runtime. This second category is sometimes called structured audio.

Within the structured audio format category, the focus for this Web site is on note-based formats generated by systems for music composition and also used for playback. The most prominent note-based formats are associated with MIDI, the Musical Instrument Digital Interface, although there are many devotees of formats called MODs, from modules, sometimes called tracker files. Other formats contain data used by synthesizers that simulate the human voice, e.g., for telephone company directory services. Since voice-synthesis content is less likely to be added to the collections of the Library of Congress, this topic is not discussed here. This document also omits the discussion of compound documents that combine sound, images, and other forms of expression, or multi-track recordings in which individual tracks are managed as separate bitstreams (files). This latter subject is part of the topic associated with what this Web site calls bundling or packaging formats, e.g., AES-31 and METS.

Normal rendering for sound
Normal rendering for sound allows playback in mono or stereo through one or two speakers (or equivalent headphones) or, in the case of surround sound, through multiple loudspeakers carefully arranged in a room (or through specialized headphones). Playback software provides user control over volume, tone, balance, etc., and the means to fast forward and to find a specific location (typically a time offset), track, or other segment, e.g., a chapter in a recorded book. Normal rendering (application software permitting) also includes playback through software that offers the analysis of sound and excerpting. Most personal computers have sound cards that can support normal rendering of note-based formats; more elaborate, specialized devices simulate instruments or produce invented sounds in a richer manner.

Normal rendering must not be limited to specific hardware models or devices and must be feasible for current users and future users and scholars. This level of functionality is expected of any candidate digital format for preserving sound content, and is not mentioned as a factor for choosing among formats.

Fidelity (support for high audio resolution)
Fidelity is a factor associated with waveform formats, and refers to the degree to which "high fidelity" content may be reproduced within this format. In this context, the term is meant broadly, referring to characteristics that will influence a careful (even expert) listening experience. The real test of fidelity occurs when the reproduction is repurposed, e.g., when a "master file" is used as the basis for the master for a new audio-CD music release. The outcome of such repurposing is likely to be more successful when the master has very high fidelity, particularly bearing in mind that advances in technology will likely raise user expectations.

The two characteristics most often associated with fidelity for sound waveforms represented as Linear Pulse Code Modulated (LPCM) data are sampling frequency and word length (i.e., bit depth).1  Other factors may also influence fidelity, such as the presence of distortion, watermarking, or--in lossy compressed renderings derived from LPCM files--audible artifacts that result from the application of lossy compression. In general, uncompressed or lossless-compressed data offers the highest fidelity; however, lossy compression based on understanding of human perception of sound provides a high level of fidelity in normal playback conditions. New techniques for lossy compression are being developed, tested, and standardized and specifications for what constitutes high-quality lossy compression can be expected to change over time.2

The effects of lossy compression may be reflected in other factors relating to choice of digital formats. In general, lossy compression detracts from transparency, desirable in any format chosen for long-term preservation. Also, constraints on data transfer rates may require balancing support for multi-channel audio (surround sound) with fidelity limitations in individual channels.

The treatment of spoken word content or artificially generated speech in the context of telephony (and related activities) has a special character, using coding techniques such as µ-law, LPC (linear predictive coding), and GSM (Global System for Mobile telecommunications) compression. Content from these modes of communication is unlikely to find its way into Library of Congress collections.3

Support for multiple channels
Multiple channel support refers to the degree to which formats may represent multi-channel audio, which is presented to the enduser at least two ways. The first is in terms of aural space, or what engineers call sound field, e.g., as stereo or surround sound. The second form of presentation consists of two or more signal streams that provide alternate or supplemental content, e.g., narration in French and German, sound effects separate from music, "music minus one" content, or the like.

Waveform bitstreams generally encode multiple channels in interleaved or matrixed structures.4   For example, stereo or two-channel sound represented in linear PCM content typically employs an interleaved structure (alternating the information from the two channels), while surround-sound or multi-channel content employs additional data that is matrixed into two interleaved channels and decoded at playback time. (Multiplexing may also be used; this encoding can be compared to the heterodyning used in radio broadcasting.) The terminology for surround sound includes 5.1 (audio channels are sent to four directional loudspeakers at front left, front right, back left, and back right, and one non-directional low frequency/woofer), 7.1 (same, with two added side-center loudspeakers), and so on. In many cases, the practical constraints of digital data transfer rates during real-time playback force the creators of these formats to limit the fidelity of some channels, typically channels other than front left and front right.

Note-based files formatted in the General MIDI System can be organized in as many as sixteen channels to allow separate "instruments" to play simultaneously for a polyphonic effect, which may additionally be structured to represent aural space. Some composers have even positioned loudspeakers or synthesized instruments on a stage and then play note-based files through a controller, thereby replicating the sound of an ensemble.

Support for downloadable or user-defined sounds, samples, and patches
Support for downloadable or user-defined sounds, samples (loops), and patches applies to note-based formats, and refers to the degree to which a format permits references to or the inclusion of digital sound data and the articulation parameters needed to create one or more voices or instruments in a musical presentation.5

Functionality beyond normal sound rendering
Note-based formats may be used in a number of interesting applications that are at the edge of normal rendering or lie beyond normal rendering. For example, specialized applications use note-based formats to produce notation on screen or on paper. Such applications may also permit file playback with selective control over the number of channels, e.g., to suppress the synthesized violin when a live musician is present, and adjustments of pitch and tempo. Specialized formats and applications feature karaoke content, in which texts are synchronized with the music. Other applications control performances or equipment, as with MIDI Show Control for live theater or multimedia, or to play the role of instrument in the midst of a group of live performers, or MIDI Machine Control for tape recorders and their digital successors.

A certain level of navigation is expected in normal rendering but some formats provide or link to data that supports elaborated capabilities, e.g., playlists; texts for recorded books; or descriptive information, including names of authors, performers, narrators; names of chapters or sections, or additional information of the types familiar to library users.

Normal rendering is provided by formats as presented to end users. Normal rendering may not be provided-or at least conveniently provided-by waveform formats that support rich-data representations of sound, i.e., representations sometimes called masters in a preservation-oriented activity. For example, rich-data versions of a linear PCM sound item may use very high sampling frequencies and/or long word lengths, and may consist of bundled packages that contain synchronized multi-track elements. It may not be possible to "play" such a file in real time on end-user equipment over a network. Such masters are typically used for repurposing and for the production of end-user forms of the same content that contain less data but offer convenient playability. The relationship between rich-data formats and normal rendering is that the rich-data item can be used to produce an end-user copy that successfully supports normal rendering.

1 An alternate approach to LPCM has been implemented by SONY and is called by them DSD (Direct Stream Digital). Audio engineers also refer to this encoding as pulse width modulation (PWM) or delta-sigma (or sigma-delta) modulation, also abbreviated as DSD. DSD is a one-bit-deep coding and, in the SONY implementation, employs a data rate of 2.8224 megabits per second. At this writing, DSD is exclusively heard on Sony SACDs (Super Audio Compact Disks), and the writers of this document are not aware of any media-independent DSD format. Comments in the engineering press vary; some state a clear preference for high-resolution LPCM sampling, while others politely say that both approaches will be acceptable to most audiophiles.

2 As of 2003, the Library of Congress Motion Picture Broadcast and Recorded Sound Division, considers an MP3 file with compression to support a 128 kilobits/second playback data rate per channel (256 kilobits/second for stereo) to be at the lower end of quality acceptable for published music. Lower-quality compression would be acceptable for culturally significant, but home-produced sound.

3 Speech-compression formats are used for some digital recorded books and other recordings. For example, the downloadable book and radio broadcasts available from Audible, Inc. (http://www.audible.com) are offered in five formats, including formats based in speech compression and others based in the formats typically used for music, i.e., MP3. The Library's preference is likely to be for the higher quality version-MP3-based in the case of Audible, Inc., rather than a lower-quality, speech-compressed version.

4 Stereo or surround sound may also be represented by multiple separate waveform files synchronized within wrapper formats like AES31 or SMIL_2_1 (Synchronized Multimedia Integration Language, Version 2.1). These wrapped sets of files may represent the aural space of a sound field, or may simply contain alternate content, like English- and Spanish-language versions, narrative commentaries on a parallel stream, or the like. Multi-file wrapper formats are not covered in this document. The AES-31 format, apparently not widely adopted at this time, is intended for management of multi-track sound content produced in recording studios, typically with the intention of later mixing to stereo or surround.

5 Two techniques for sound generation are employed, depending on the format: direct inclusion of waveform segments and synthesis. The preferred method for synthesis of instrument sounds (in 2004) is wavetable synthesis. Formats exist for the exchange of wavetable sounds (sometimes called sound fonts) independent of music formats. An older synthesis technique is called frequency modulation synthesis (FM).


Back to top

Last Updated: 01/21/2022