Digital preservation is broadly defined as actions taken to ensure the longevity of information created in or converted to digital formats. Preserving digital information requires continuing investment in and professional management of reliable infrastructure and application of effective preservation strategies for the products of obsolete or even very new technologies.
Corruption and Loss of Digital Information
Physical Degradation of Storage Media
Most creators of digital information store their work on the hard drives of personal computers or in servers, but often they then move the products to storage media such as compact discs (CDs), DVDs, or data tapes. The many varieties of each of these media employ different methods of affixing data to their recording surfaces, but all of these media are subject to physical decay or degradation. Data errors or media failures are often caused by improper physical handling such as scratches on the recording surface, bent or warped base layers caused by excessive humidity or temperature variation, or use of adhesive labels or hard-tip pens. Data errors are also commonly encountered during the data-recording process, and error-correcting code is used to identify errors or bad sectors in the recording surface and to relocate data bits to healthy parts of the medium.
Redundant storage of data in several physical locations or in offline hard media such as discs and tapes is generally recommended to mitigate the effects of natural disasters or media failures. If hard media is used for archival storage, it should be carefully labeled using Sharpie® CD/DVD markers, tested periodically, and copied to new media on a regular schedule.
Obsolescence of Storage Media
Manufacturers of storage media are constantly seeking to increase the storage capacity and reduce the size of their products. The size of the media sometimes constrains efforts to increase capacity, so new data-recording technologies are developed to enable higher storage density in the same size or even smaller carriers. Efforts to increase capacity have resulted in the rapid and ongoing development of new storage media of different sizes and shapes. CDs and DVDs are available in several sizes, and some early DVD formats employed cartridges to hold the recording media.
As a result, no media readers or recording devices are compatible with all the existing physical forms. One technology magazine reported in 2002 that the production life of a digital recording device is about ten years, and the manufacturers offer product support and parts for that device for five to seven years after production of the product has been terminated. The result is a supported product life span of about fifteen years. Accelerated production of new proprietary media storage formats in the last several years has significantly reduced the supported life span of media readers and recorders.
Storage media may also become obsolete when current media readers cannot read the data because of the nature of the recording surface materials and the recording process employed. Floppy discs of 5.25” were produced in several recording densities, and incompatibility of disc drives and floppy discs was a common problem in the 1990s. Now Blu-Ray DVDs and HD-DVDs employ completely different recording processes from DVD-R and DVD-RW formats. Some hardware manufacturers have licensed a variety of recording technologies to make their reading or writing device compatible across several proprietary formats, but other devices read only one or two digital recording formats.
The rapid development of new media formats presents a challenge for digital preservation because custodians of digital materials must monitor new technology developments and move current and legacy data to modern, but mature, storage formats. If hard media is used, the custodian should wait for new technologies to demonstrate reliability and a wide installed base, rather than adopting the newest storage technology as soon as it is available.
Whether stored on a server disc array or on the hard drive of a personal computer, virtually all digital information requires software to render the information on the computer screen in a way that can be understood by users. Operating systems and software provide the instructions that tell the computer how to display or render digital information in the form of words, images, sounds, or videos. Data is often described by digital preservation specialists as “software-dependent,” meaning that the data is formatted in a way that requires a particular operating system and software package for accurate display. This combination of the data format and the software enables us to perceive and understand digital information.
Migration is the process by which files created in one software program are moved to a more recent version of the same program, moved to a completely different program, or moved to a data format standard that may be read by several programs. Total migration failure, or the inability to read a file originally produced in a different program, is becoming less frequent with textual/numeric data, but it is a more common problem with digital audio and video information.
As software developers strive to enable more efficient, flexible, and creative data processing, they create new products that may not correctly render a file created in another, similar software program, or even an older version or release of the same program. In the early days of personal computing it was not uncommon for a word processing file to be completely unreadable by different word processing software. Now many software programs are compatible in being able at least to display information produced in different software, but often the rendering is not completely accurate—in word processing it is common for errors to be introduced in a file opened with a different program than the one used to create it. Formatting errors and incorrect replacement of certain characters are common results of migrations of text files from one software environment to another. Checksum tests can be used with textual, audio, or video files to determine whether the total number of characters or bytes has changed as a result of migration.
Textual data standards such as ASCII text and Hypertext Markup Language (HTML) are supported by many software manufacturers, but saving standard format files in software can introduce proprietary functionalities and codes. These unique features of software products are owned by the software developer, not shared with other software products, and are not compliant with industry, national, or international standards. They set the software product apart from those of competitors, but they may result in rendering errors or compatibility problems. As a result software may be advertised as compliant with certain data standards, but the resulting data may not be completely compatible with other “compliant” software. Text encoding standards such as HTML and Extensible Markup Language (XML) are maintained by an organization entitled W3C. Their website offers validation services that can be used to test compliance of specific web pages.
There are a variety of file formats for digital photographs; some are proprietary formats, but most digital cameras offer a choice of proprietary or standard file formats like TIFF, JPEG, and JPEG 2000. These digital image file formats may employ compression algorithms that are used to remove detail and white space from image files in order to reduce file size. Compression is generally considered “lossy” or “lossless”; lossy compression causes some loss of quality each time a file is saved, whereas lossless compression results in no loss of detail. In general the choice of digital image file formats represents a balance between storage efficiency and the need for accurate rendering. Compressed files are often used for efficient delivery of images in websites, but compression is generally not recommended for archival versions of digital image files because of the loss of image quality that occurs when the files are repeatedly saved.
Digital audio and video file formats have a very high level of incompatibility at the present time. Consequently, audio and video data is often maintained by upgrading to the newest versions of the same software, rather than attempting to migrate the files to completely new software. This is an acceptable preservation strategy as long as the software manufacturer continues to support its product.
Because migration often causes some degree of corruption or loss of formatting, digital preservation becomes a more complex task. Simple “bit-stream” preservation refers to preservation of files outside of their software and operating system environments, and it relies on checksum tests to ensure that no bits are lost; however, this approach ensures neither accurate rendering of the content nor longevity, because software is required to view the files. When accurate rendering is desired, the digital preservation strategy must usually include preservation of specific software, and acceptable error rates or tolerances for accuracy should be established for that context. File format registries are now being developed to enable validation of software-dependent files and to document how those files behave during migrations.
Another approach to preservation of software-dependent content is emulation. Emulation involves writing software and/or building hardware that allows a computer to imitate the appearance and functionality of obsolete software. This approach has been commonly employed for several decades (especially for hardware like obsolete computer keyboards), but recent research suggests that emulation has not achieved completely accurate reproduction of the original appearance and functionality of simple video games. Further research and testing of emulation solutions is in progress.
Preservation of software-dependent digital information is the most difficult challenge in digital preservation. There are currently no solutions that enable perfectly accurate reproduction of appearance and functionality, but sometimes a degree of data loss is acceptable. In small production environments when errors or formatting problems can be detected by manual inspection, it is often possible to write simple search-and-replace routines to correct the errors. As the quantity of digital information grows, however, human intervention becomes impractical, and automated error detection and correction solutions must be employed.
Metadata is defined as information created and maintained to describe the content and context of a unit of digital information or a digital object. Descriptive metadata is information that describes the author, title, and other contexts of a digital object in a way that is similar to how the library catalog describes a book. Administrative or technical metadata is like a medical chart for a patient, describing what software and hardware is necessary to view the files, what migrations of this data have occurred in the past, and what errors resulted from past migrations. Structural metadata is used to describe relationships between components of related digital information so that they may be reconstructed to assemble a complex object consisting of several digital files. For example, structural metadata is required to present or maintain the correct order of chapters and pages of a book. Preservation metadata includes elements of descriptive, technical, and structural metadata specifically needed for preservation. Several technical standards are available for metadata, including Dublin Core, the Metadata Encoding Transmission Standard (METS), and PREMIS, the current standard for preservation metadata.
Some metadata is created automatically by software and/or hardware when digital information is created (e.g., digital cameras often add the date of the photo to the image file). Other metadata must be created by librarians or other technical specialists upon inspection of the digital information. Sometimes metadata is created by individuals who add descriptive information when they contribute digital information to a website; this is known as “author-supplied” metadata.
The absence or loss of metadata is a serious digital preservation challenge. Well-formed and standardized metadata enables automated preservation actions, but it usually requires professional expertise applied by someone other than the creator or author. Author-supplied metadata provides some context for the digital information, but lack of standardization makes it much more difficult to conduct searching and automated preservation actions with accuracy. Digital cameras, scanners, and production software usually provide metadata attached directly to the digital files created with these products, but inconsistent support for metadata standards among manufacturers continues to impede application of automated preservation routines.
A specific concern is the loss of addressing or linkage metadata. A Uniform Resource Locator (URL) is the commonly used address for websites on the Internet, but addresses often change as servers or file systems are reconfigured. Several studies of URL persistence place the approximate life span of an address between six weeks and nine years, depending on the age of the address and its location in a file system hierarchy. (Older addresses are more stable than newer ones, and file addresses at the top of a file system tend to be more stable than those at deeper levels.) In recent years sustainable addressing services or link resolvers such as SFX use the OpenURL framework to provide a standards-based and sustainable addressing solution for electronic resources on the web. Complex digital objects present another addressing challenge, because files must often be found and assembled from several servers to render and use them correctly. Metadata standards for file names and folders are needed to reconstruct relationships consistently between the files of complex digital objects.
In general, metadata enables efficient discovery of digital information, accurate rendering, automated error detection and correction, and ultimately more reliable digital preservation. The absence or loss of metadata is a serious impediment to successful digital preservation and leads to expensive and at times unsuccessful human intervention. Metadata can be as simple as accurate labeling on storage media or backup tapes, or as complex as a qualified Dublin Core record and PREMIS information wrapped into a METS-encoded complex object, but some amount of metadata is always necessary to enable effective digital preservation.
Backups, Crawling, and Snapshots
When information technology professionals are asked about digital preservation, they often point to their server backups as their preservation solution. Backups are an important part of the professional management of any server or personal computer, and they are essential in cases of operator error, network failure, or natural disaster; but backups should not be considered a digital preservation solution because they don’t account for obsolescence of software-dependent content. In addition to backups made on a regular schedule for disaster or short-term file recovery purposes, files of enduring value should be copied to a trustworthy record-keeping system along with sufficient metadata to enable active management of the files. Standards for trustworthy record-keeping systems are currently in development.
“Crawling” refers to technologies called “crawlers” or “spiders” that search the Internet and find websites that meet certain specifications. Often these tools are used to copy important website files to a record-keeping system for archival purposes. The best example of this work is Brewster Kahle’s Internet Archive. Copying a website at a given moment produces what is called a “snapshot.”
Because the data on a server or in a website changes constantly, the timing and scope of backups and snapshots is an important part of digital preservation. To ensure that a complete and useful version of a website is preserved, it is necessary to ensure that all the relevant files are saved at the best moment. The timing of a snapshot should be established in relation to the updating schedule or the frequency of content changes in that website. Because crawlers are sometimes blocked from accessing the “deep” or “hidden web,” they must often be specially programmed to access and copy files from those areas of a server. Sometimes the sheer size of digital image or video files makes automated acquisition and preservation problematic, so the scope and formats of files selected for preservation become important considerations.
The final frontier in digital preservation does not involve technology by itself, but the interaction of humans with it. Our technology is only as good as the humans who make, maintain, and operate it, and humans do indeed make errors. Quality control and quality assurance are extremely important parts of digital preservation work, even when digital preservation routines are automated. Verifying the timing and scope of snapshots and backups, the accuracy of metadata, and the physical condition of storage media are important human interventions to ensure that our technology is working properly. As the quantity of digital information increases, these human interventions are typically managed through sampling rather than comprehensive inspection.
Human factors also include attempted theft and destruction of digital information. The steady increase in viruses, worms, Trojan horses, and other damaging technologies has been well documented by the businesses dedicated to detecting and preventing such infections. New viruses are discovered weekly. Server management and data security are a complex set of professional services that require constant software and operating-system upgrades to patch insecure systems and prevent virus infections. System firewalls are used to protect data from outsiders, and data security firms actively monitor servers for unauthorized intrusions (although most system security breaches are discovered after the fact). Maintenance of virus-detection software, use of firewalls, and redundant storage are now essential parts of digital preservation work.
Finally, the most important human factor to consider may be the will to preserve. Many digital preservation specialists agree that the technologies, work flows, and data-management practices employed by creators of digital information have a significant impact on the longevity of their products. Often busy developers of digital products are not aware of the long-term significance of their products at the time they create them, and they don’t take the time to apply metadata and ensure that their products are maintained in a trustworthy record-keeping system. As a result, some archivists and digital preservation specialists have attempted to help creators with their production work flows so that sustainable production formats are selected and the content is moved to reliable infrastructure in a timely way.
Effective digital preservation requires an enduring commitment to active management, access to trustworthy infrastructure, and continuing investments in technical staff, software, and hardware. Long-term digital preservation is usually a cumulative endeavor; thus as more content and more diverse formats are acquired, the resources needed to assure the survival of digital information grow over time. In addition, continuing changes in technology will require research into and development of migration and storage solutions that will work with the technologies of the future.
Written by Robert P. Spindler