Following is an interview with Howard Besser, professor of Cinema Studies and director of New York University's Moving Image Archiving & Preservation Program, as well as senior scientist for Digital Library Initiatives for NYU's Library.
Should you care about what Howard Besser has to say about digital preservation because he is a prolific author, honored professor and scholar, groundbreaking researcher, in-demand speaker or often-quoted visionary? Or because his encyclopedic knowledge and expertise ranges from film and art to multimedia and museums, from databases to practical applications in distance learning? Or because of his decades of involvement with information technology and metadata standards?
Maybe. Or maybe the reason you should care about what Howard has to say is because of his collection of over 2,000 T-shirts.
After all, he needs to keep track of all those T-shirts. So far about half of them have been cataloged: 533 are in an online searchable database and another 600 are imaged and cataloged in another database. At some point soon Howard will merge the two databases.
"It has been an interesting issue in preservation," Howard says, "keeping the access to this database running over a 12-year period. Due to both my own job changes and server replacements, the database has moved to new machines four or five times. Each new machine has required tweaking of the database software and/or file names, and changes in HTML have also necessitated changes. The original home-grown database … turned out to be unsustainable over server changes."
So the T-shirts themselves are stored safely but over the years the information about them has required vigilant custodianship, changes in digital preservation strategy and upgrades to both hardware and software. Adhering to standards should be an important consideration for a digital library professional of Howard's caliber; after all, he has participated in the development of such international standards as Dublin Core and the Metadata Encoding and Transmission Standards (METS). "The database was first put up on the Web in 1995," Howard says, "when Web browsers were very new and before there was a Dublin Core, so the metadata does not adhere to Dublin Core."
He says one of the daunting issues with digital preservation is that you need an entire complex infrastructure to display the contents of a digital file: the "right" player, driver, interface, operating system and software (and version). That infrastructure is fragile and continually changing. If even a single element within it breaks or becomes obsolete (which inevitably will happen within a decade or two), the entire network can become unviewable. Today we might not be able to read the contents of a 10-year-old floppy disk or disk drive or decode a mid-1990s version of Microsoft Word.
By contrast, we can read still read something as tangible and simple as a papyrus scroll thousands of years after its creation. Humankind's least sophisticated media – chiseled or painted stone – might last for millennia, and our most sophisticated media – to date, digital – might last a few decades.
However, Howard is encouraged by our progress in digital preservation over the past 10 years. He believes we have a long way to go but we have come a long way already. Progress in the field of digital preservation has accelerated significantly in just the past five years alone. And the conversation among digital-preservation professionals has become sophisticated and productive. Howard says that increasingly he's talking to people who know quite a bit about digital preservation; as recently as five years ago he had to explain what it is. But he points out that we still have to determine and agree upon appropriate practices and standards. This work is continuing.
He cautions against carrying old physical archiving paradigms into the realm of digital archiving. We should not be lulled by the concept of "permanent" persistently usable digital objects or think that any digital object would maintain its exact form even 100 years from now, because components of the digital-preservation infrastructure constantly change. "You cannot just lock it away and it'll be OK," he says. "Digital [archiving] requires ongoing management." But he says that you can assure some level of longevity.
Yet it is also possible that we may be taking some faulty approaches in digital preservation, ones that we may not discover for years. And it may be inevitable. Howard says that in the art-preservation world, for example, "Certain techniques that we thought were good we later found out are bad, such as removing varnish from a painting. We need to revise our approaches periodically, since we can only assume we're doing OK."
The concept of digital preservation gets tricky and often divorced from the realm of preserving physical objects. Of course, as with archiving a physical object, you need to maintain a digital object's authenticity, accuracy and reliability. But that does not mean you never change the actual digital object. Howard believes that, in addition to changes we might make in migrating a file so it is viewable with contemporary software, as a safety valve we need to try to keep another copy of the object in the original bit order – preserve the bits and bytes of a JPEG image, for example. In any case, the most important data is the metadata that gets wrapped around the object.
If, for example, the body of a JPEG is missing a few bits, then the image may not render completely but you can conceivably still display a satisfactory chunk of it. Think of "Venus de Milo" and how its missing arms do not take away from the value of the rest of the statue. Howard points out that if a digital file's header data is missing or corrupted or if the JPEG's database record is missing or misnamed, or if a database index is corrupted, then the object is essentially lost. "Even a fraction of a percent [of bit loss] can affect everything," he says. "If that bit corrupts just one character in a text, it may be OK. But if it's in the header, that throws off the header. And then it is unacceptable."
Referring to the recent findings of David Rosenthal, which observe that we cannot have 100 percent preservation, Howard says that this implies that we need to think about where loss may occur and prioritize what we can save. Certain areas within digital files – like the header data – are a higher priority than others – like the content. He advises that we accept a certain amount of loss in content and focus on preserving the indexing information. Rosenthal is a co-founder of LOCKSS, a partner in the National Digital Information Infrastructure and Preservation Program.
Wrappers need more research and standardization to ensure that future digital objects are able to self-identify and thus be renderable and accessible. But he assures that "we are agreeing on standards on how to wrap a file." For example he is involved with a standards-setting project with the Library of Congress's Motion Picture Broadcasting and Recorded Sound Division and other participants to wrap several layers of metadata, one within the other: MXF (used in digital distribution to move files from one TV station to another), AAF (used in digital video editing) and METS.
Archiving More than the End Product
Another of Howard's concerns is the focus on the end artistic product and the potential loss of all the contributing material that went into it. He speaks of the need for libraries to archive more raw materials, such as movie outtakes and versions and drafts of literary works. He says, "It is the raw material upon which we build further research, which ties in with social information." It gives the art or literary work social and historic context. And researchers might want to look at earlier drafts of the artist's works to see how they change over time.
What about correspondence? In traditional archives and in library special collections, the correspondence of, say, a literary figure would be guarded and preserved. But how about blogs, Wikis and e-mail? "Who is preserving those?" Howard asks. "No one. E-mail is private communication. Where do they go? If a library got them, what would they do with them? Historically, analog production of these are a significant part of the archives, yet we don't have a plan on what to do with these when they originate in digital form."
Howard says it's also easy to imagine that "someone dies and leaves their hard disk to their relatives, who in turn donates the hard disk to the Library of Congress years later when it may not be readable, when it's too late." So he makes a case for a more proactive approach in capturing cultural digital data from their creators. "A repository has to be active while folks are still alive and their cultural treasures are still fresh. If we wait it may be too late…. Librarians have to be aggressive about seeking it out from people while they're still alive."
The 1996 report to the Commission on Preservation and Access that Howard participated in outlined three approaches for keeping digital information alive over time. These approaches are now a standard part of the digital-preservation community's dialog about digital archiving:
- Refreshing: periodically moving a file from one physical storage medium to another to avoid the physical decay or the obsolescence of that medium.
- Migration: periodically moving files from one file-encoding format to another that is usable in a more modern computing environment.
- Emulation: software that mimics every type of application that has ever been written for every type of file format and makes them run on whatever the current computing environment is.
Howard began the T-shirt database around that same time too, in the mid-'90s, with the help of his students. "I strongly believe that the T-shirt database was an important learning tool for my students," he says. "Not only did they learn important issues of scanning and file naming, but they also had extensive experience in creating a data dictionary with controlled vocabulary from scratch."
He has since run into former students who are carrying forward the digital preservation mission. Howard says with pride, "They are employed by Portico, JSTOR, the California Digital Library and the Getty Conservation Institute. Other students became some of the first metadata librarians in their institutions because their cataloging experience was much more versatile than merely following AACR2 rules." Thus a somewhat personal project became a real-world model for digital preservation and an inspiration for its creators.
So he will continue to maintain the longevity of the digital data about his collection according to the evolving best practices. And if it's done right, researchers may be asking in 2107, "But why did Howard Besser have 2,000 T-shirts?"