Fundamentals of AV Preservation - Chapter 4

4.1 Redundancy and Geographic Separation | 4.2 File Fixity and Data Integrity | 4.3 Information Security | 4.4 Resources

Section 4: Active management

Digital preservation requires active management to ensure that digital assets persist over time. Compared to analog counterparts that may need little beyond stable environmental conditions, digital content requires constant monitoring for changes and mitigation should changes arise.

Change can occur in a variety of ways. Files may be updated or altered intentionally for good or for nefarious purposes. Natural disaster, hardware and software failure, bit rot and human error can all affect the stability of digital content, as well. A key component of active management is an awareness of the threats posed by change and employing systems to reduce the chances that change can occur.

This section describes some of the key functions needed to actively and fully manage digital content for preservation.

4.1 Redundancy and Geographic Separation

Maintaining multiple copies of digital content in different geographic locations is a fundamental practice of preservation, whether in the physical domain (more than one institution may preserve copies of the same film) or in the digital domain (the same audio file may be stored on servers in Los Angeles and New York). It also means ensuring that your digital content is stored on more than one type of media; for example, spinning disk and data tape.

In the digital realm, it is ideal to maintain three copies of your digital content, stored in different geographic locations, on different types of media, and maintained in such a way that the copies are always the same.⁷ This ensures that if something happens to one copy in one location or on one type of media, at least two other unchanged copies of the digital content persist.

1. Preservation Data Backups

Most of us have accidentally deleted a file that our IT department was able to restore, either in its entirety or in an earlier state based on when they last performed a backup of the server on which the file was stored. Active data backups involve copying actively used production files, often on a daily basis, for the purpose of short-term retention while the files are in use. These backups might be saved for a week or a month, but after a period of time they are overwritten by new backups. Access copies of digital audiovisual content—the files that you use and share on a frequent basis—are typically backed up in this way.

Preservation data backups are slightly different. Instead of files in active use, the best-quality versions of the digital files that are in a finished, inactive state (often referred to as “preservation masters”) are copied from a primary storage location to a secondary (and ideally, tertiary) storage location. All three copies of replicated data are typically composed of identical “packages” that contain the digitized preservation masters, as well as preservation metadata to help identify and use the files when, for example, the access copies are no longer viable and must be replaced. The preservation masters are not meant to be accessed in the near term, however. The length of time that these preservation packages are backed up is dependent on their retention requirements. In the case of digital audiovisual content, the retention schedule is often “for as long as possible.”

2. Geographic Distribution

Ensure that copies are in geographically disparate storage locations/systems to decrease the likelihood of loss due to localized disaster or service interruption. For example, if your data center is located along the coastal United States, it is best to store duplicate copies of your digital content on servers in another region where hurricanes are not a concern. Or, for example, a university may store digital content on one type of media in campus buildings, on premise, and on another type of media in a satellite location, off site, 20 or 30 miles away. It may also maintain a third copy of the digital content even further away to provide geographic diversity, or even in the cloud (see Section 3: Storage Infrastructure).

Ideally, storage devices will allow preservation packages to be actively monitored so that errors in data can be identified and repaired. How that identification and repair happens is described in the next section.

4.2. File Fixity and Data Integrity

In the context of digital preservation, fixity describes the unchanged state or “fixed-ness” of a digital file. Fixity monitoring can identify whether a file has changed for any number of reasons, such as human error, hardware failure, or bit rot, so that action can be taken to repair it.

1. Zeros and Ones

Digital files are, at their most basic level, made up of a series of zeros and ones.⁸Each number is a “bit.” A file is essentially a very long list of bits stored in a particular order that is different from the order of every other digital file, and this enables a computer to access and play it. A small change, such as turning a single zero into a one, changes the makeup of the entire file, sometimes in catastrophic ways.

Keeping track of all of the bits is hard. Files can be thousands or millions of zeros and ones long, and it is important to ensure that files are stable, or “fixed,” so that they can be accessed. Instead of keeping track of the bits themselves, we can check the files’ fixity by watching or listening to each of them on a regular basis, but that is time consuming and unrealistic in large collections. Luckily, there is another method for monitoring fixity to ensure that if a file does change, we catch that change and can repair it.

2. Monitoring Fixity with Checksums

To make monitoring files for changes to the bits easier than keeping track of millions of zeros and ones, we use shorter, alphanumeric strings that reflect the uniqueness of every digital file. These strings are called checksums and are generated by a program that reads the zeros and ones of a file and creates a unique string of characters to represent them. This checksum becomes that file’s signature and will remain the same as long as the bits do not change.

If a file is changed, even in seemingly insignificant ways (you may not even be able to tell just by playing a file), a completely different checksum will be produced by the checksum generator.

FIGURE 4.5
Image of a list of checksums, whose signatures are on the left, and their associated files. The algorithm used to create the checksums in this example is called SHA256, although there are others, such as MD5 and SHA1, in wide use as well.
Credit: AVP

Checksums are valuable for several reasons. For example, they can be used to authenticate a file. If a file is the official version of a video, it can be authenticated by first creating a checksum signature and then running the checksum program later to be sure that the signature has not changed. Or, if a file is being deposited in an archive, a checksum may be produced before deposit and then after deposit. This one-time test authenticates that the file is what the depositor understands it to be and that it was not corrupted upon ingest.

One of the greatest values of checksums is their use in monitoring file fixity, which tests for fixedness or stability of the bits over time. It is important to note that while different files have different checksum signatures, exact copies of files will have the same signature. As long as a file doesn’t change, it will always have the same checksum signature as other identical copies. That means files can be monitored for change using a tool that checks fixity on an ongoing basis and repairs files when checksums do not match by replacing them with another unchanged copy of that file. Monitoring fixity over time (e.g., once every month, six months, or a year) allows you to identify changes and replace corrupt files with an unchanged copy.

Finally, while checksums are the primary mechanism for monitoring fixity at the bit level, they can also be used to monitor file attendance, or identifying if a file is new (the checksum signature has never been produced before), removed (a checksum is missing from a list), or moved (the checksum appears with files in another location). Tracking and reporting on file attendance is a fundamental component of digital collection management and fixity.

3. The Fixity Tool

Some institutions have sophisticated systems and workflows that automate the monitoring of fixity and file attendance in their digital collections; however, many do not. A good place to start with fixity and tracking checksum signatures, if technology systems are not available to automate this process, is by maintaining an inventory in spreadsheet software that lists the files in your collections, their locations on your servers, and their associated checksums and the dates that they were produced. Over time, the inventory will help you identify changes at both the bit and file level so that you can find and repair the errors in your collections.

There are a variety of tools for creating and verifying checksums. Some of these include:

● ExactFile (Windows, http://www.exactfile.com/). Calculates a variety of checksums.

● FastSum (Windows, http://www.fastsum.com/). Calculates MD5 checksums.

● HashMyFiles (Windows, http://www.nirsoft.net/utils/hash_my_files.html). Calculates SHA1 and MD5 checksums.

In addition, all operating systems have built-in checksum generation and validation functionality; however, using them requires access via a command-line user interface.

An institution might also consider the free and open-source tool Fixity.⁹ Fixity was created with smaller and/or lesser-resourced organizations in mind and is a simple application that enables automated checksum production and file attendance monitoring and reporting. Fixity scans a folder or file directory and creates a manifest of the files including their file paths and checksums, against which a regular comparative analysis can be run. Fixity monitors file integrity through generation and validation of checksums and file attendance through monitoring and reporting on new, missing, moved, and renamed files. Fixity emails a report to the user that documents flagged items along with the reason for a flag, such as that a file has been moved to a new location in the directory, has been edited, or has failed a checksum comparison for other reasons.

4.3 Information Security

The ultimate goal of preservation is to ensure that collections remain unchanged and accessible over time. Information security protocols help to minimize accidental or nefarious changes by users and the public and to track how files are changed, who changed them, and when the alterations were made. Protocols help ensure that authenticity is maintained, and when it isn’t, that a change is documented so that an organization can act on that change by, for example, replacing corrupted files with backup copies. This section describes some of the ways that information security is used to maintain the authenticity of digital collections.

1. User Permissions

In a digital preservation environment, organizations must control which users are accessing and manipulating data. Some users may have access to view files, while others may have controls over where files reside, their formats, and who can access them. Creating, assigning, logging, and managing permissions and restrictions are critical to mitigating the risk of intentional or unintentional data corruption and misuse of content. Many preservation management systems make permissions management easy. However, when data management happens manually or outside of a management system, then setting access permissions on directories on networked servers can be a good way to create a secure space to store digital collections. IT staff can usually help implement these strategies.

Computer systems offer varying levels of permissions that enable or restrict access to digital collections. Four types of permissions are read, write, move, and delete. These functions are more or less what they sound like, although the concepts of “read” and “write” may not be entirely obvious. In this context, “read” access refers to the ability to view files in a directory without being able to edit or delete the files. “Write” access means that a user can add files to a directory or edit files within it. The levels of permissions are cascading, starting with the lowest level of access—none—and ending with the greatest--“delete.” This means that a user with “delete” access also has “move,” “write,” and “read” access. Conversely, a user may have only “read” access or no access at all. If you are able to set permissions on directories that contain preservation copies of digital files, then consider doing so, but with caution. Always be sure that an administrator has full access to the collection, so that if the digital content needs to move, migrate, or be monitored for fixity, there are no restrictions on doing so.

TABLE 4.2 An example of how digital preservation access permissions can be applied to users within an organization. Permissions will differ for each organization, and it is important that staff responsible for digital preservation have a voice in how they are applied. Credit: AVP
Staff position	Access Permission on Storage Device
	Read	Write	Move	Delete
Digital preservation staff	X	X	X	X
IT support staff	X	X	X
Other internal users	X

As with all digital preservation activities, it is important to document decisions about permissions. It is one thing to assign permissions, but having information at hand that documents who has been assigned which permissions will help identify whom to contact when diagnosing data errors. Logging access and internal actions taken on digital collections is equally important, although challenging, in an environment where the logging is not an automated process. Logs, or audit trails, enable digital collections administrators to audit actions taken and track changes to a user and date, which can be valuable when trying to understand when and where errors have occurred. Although by no means impossible, these activities can be time consuming in a manual workflow. An example of a manually constructed audit trail might include the following notations:

TABLE 4.3 An example of a manually constructed audit trail that includes user, file, action, and date. Credit: AVP
User	File	Action	Date
Jane Smith	f10d6b9e.wav	Moved file from Directory 1 to Directory 2	2016-09-07

The good news is that many preservation management systems automate this work. Whatever preservation tools you have at your disposal, always remember to do the best you can do now, so that you’re ready to do more when the resources and infrastructure enable it.

2. Protecting from External Threats

Every organization needs to be concerned not only with controlling access to digital collections from within, but also with protecting assets from external threats. This is particularly important for storage devices such as servers that are connected via networks and to the internet. Controlling access to these devices with good password and username practices is imperative. Passwords should be unique, high-strength (upper and lowercase letters, numbers, and symbols), long (12-15 characters is recommended),¹⁰and changed routinely.

Operating systems, when not updated routinely, can be another potential risk to a network. For example, at the time of this writing in 2017, Windows XP runs on 7% of computers across the globe, although support for it from Microsoft officially ended in 2014.¹¹This means that security patches, created in response to active virus threats and other nefarious software, are rarely produced. This puts these computers and the networks on which they run at risk of security breaches. Taking a proactive approach to keeping operating systems up to date decreases the chances of data breaches.

3. Virus Scanning

Another important component of information security is monitoring digital collections for viruses and other corrupting malware. Virus scanning should be performed on files that are being brought into a digital preservation environment from external sources to avoid infecting existing files and systems. If you are performing digitization internally and have full control over the files being created, this is less of an issue. It becomes a greater concern when accepting files from external sources, such as donors or even other units within an organization. Virus scanning should be performed on all external data transfers into the environment and then routinely in the digital preservation environment as an added precautionary measure. One approach is to have a dedicated “clean” computer that is not connected to a network or the internet. Files can be virus tested on this machine before transferring them to networked storage. This may be especially useful to institutions where IT support is not readily available.

A good idea is to create a folder that is specifically used to store newly acquired files to prepare them for ingest into your preservation environment. The folder should not be directly connected to your preservation environment. If you cannot check the files on the media on which they are delivered, transfer new files directly into this folder and immediately perform virus checking on them. That way, if corruption is identified, the files cannot infect your existing digital collections.

There are many low-cost virus scanning software options available on the market today. If you have an IT department, talk to them about virus scanning before you purchase your own software. They might have options available that are already in use in your organization.

4.4 Resources

Fixity. AVPreserve.
https://www.avpreserve.com/tools/fixity/

Information Security. Digital Preservation Handbook.
https://www.dpconline.org/handbook/technical-solutions-and-tools/information-security

Storage. Digital Preservation Handbook.
https://www.dpconline.org/handbook/organisational-activities/storage

“What is Fixity, and When Should I be Checking It?” Checking your Digital Content: A NDSA Publication, 2014.
http://www.digitalpreservation.gov/documents/NDSA-Fixity-Guidance-Report-final100214.pdf

Section 5: Metadata ›