3.1 Storage Architectures | 3.2 Storage Media | 3.3 Options for Storing Digital Collections | 3.4 Resources
Section 3: storage Infrastructure
In the digital environment, how digital content is stored is paramount. As noted above, standards such as the ISO 14721: Open Archival Information System (OAIS) and ISO 16363: Audit and Certification of Trustworthy Digital Repositories (TDR) provide a framework for how repositories should be structured and managed and what actions should be taken on the digital content within them. For example, best practice suggests that digital content be stored on “active” servers that are backed up and managed with preservation in mind. Storage on fixed devices, such as DVDs or external hard drives that are not monitored or backed up, do not meet the requirements of a TDR. And, they have high failure rates—the bits rot and the media inevitably degrades, which can mean catastrophic loss to collections stored on them.6
The often-heard “storage is cheap” aphorism is true when it comes to per-bit storage costs at scale. However, the reality for many institutions, especially small and mid-sized organizations, is different. The cost of digital collections storage can be significant, especially for audiovisual collections that are terabytes or petabytes in size. For this reason, cost often plays a significant role when an institution selects the type of storage and determines how it is managed. The good news is that there are less-expensive options that still allow for an acceptable level of maintenance of digital collections. Understanding what is available, the associated costs, and the risks involved in the various options is key to choosing the best solution for an institution.
At a high level, storage options can be broken down into two categories: local or on-premise storage, and cloud or outsourced storage. Based on an institution’s requirements, technical infrastructure, and resources, one or both options may be feasible. Decisions about what type of storage works best for an institution’s needs should be influenced by such factors as:
- The level of reliability or “uptime” required. Do you need immediate access to your digital content or can there be delays of minutes or hours in retrieving it?
- The number and types of users that need access to it. Who will take responsibility for managing the digital content—digital collections managers only, the entire archives staff, or someone else? Will a version of the content also be publicly accessible?
- Types and amount of digital content. How much storage do you need? At what rate will it grow?
- Redundancy. Is an institution capable of safely managing two or more copies of its digital content locally, or must it rely on cloud storage?
Considering these issues alongside best practices such as those in the NDSA Levels of Digital Preservation (see Section 1.5), levels of effort required, and the resources in place to support them will help an institution identify the best storage options for its situation.
In the rest of this section, storage media, storage architecture, and storage capacity are detailed in an effort to provide practical guidance on best practices.
3.1 Storage Architectures
Online, nearline, and offline are terms used to describe different types of storage architectures. These terms speak to the ease and immediacy with which data can be accessed as well as the varying costs and scalability of storage.
Online: In this context, online means that the data is immediately available to users on a storage system. Servers that host an institution’s networked drives are examples of online storage systems. This is the fastest, but also the most costly, of the three architectures. It is also the most common. Examples of online storage include flash and spinning disk, both described below in Section 3.2.
Nearline: In this case, digital content is available to users with some lag time, which can be a few seconds to a minute or longer. It is automated and networked in the same way that online storage is, but the media is different, typically a magnetic tape library. (Magnetic tape is described below in Section 3.2.) This tends to be an option used by larger institutions that have the resources to diversify their storage architectures.
Offline: Here, digital content is stored on a piece of media that requires a human to connect it to a computer in order to access the data on it. The most common offline media type used for digital preservation is magnetic tape. Offline storage is often used to backup digital content for long periods of time. This kind of storage architecture is cost effective, but it takes time to access because it is not connected to a network. For digital preservation, offline storage is often used for the third-copy backup, or disaster recovery copy, of digital content.
Why use one architecture over another for audiovisual content? There are a number of reasons, including the following:
- Cost. Offline storage tends to be the cheapest, but it has drawbacks in that it is not actively managed by automated digital preservation processes like fixity monitoring.
- Immediacy. Online storage has the quickest response time – typically content is immediately available when needed. However, it may not be necessary for all content to be accessed immediately. Second or third copies of digital files are often stored on nearline servers or offline Linear Tape Open (LTO) tape because they will not be accessed regularly and latency is higher.
- Bandwidth constraints. Larger files, such as high resolution video, take time to transfer over networks. If quick access to high resolution files is required, the cloud might not be an ideal solution because it requires transfer over the internet, meaning that the available bandwidth in and out of your facility provides additional constraints.
- Scale. The scale of audiovisual collections can be massive. It may not be cost effective to store all digital content on online storage. As long as two copies are actively managed using online or nearline architectures, other copies can be stored offline (e.g., on LTO tape), which tends to be the most cost effective method of storage.
3.2 Storage Media
There are a variety of storage media options available to digital collections managers. Some are widely accepted as preservation-appropriate, while others are recognized as problematic due to their susceptibility to failure and obsolescence.
Two predominant storage media that are considered good options for digital preservation management are referred to colloquially as “spinning disks” and “magnetic tape.” This is by no means an exhaustive list, but it provides some idea of the options available for the purposes of managing digital content.
Spinning disk storage that is part of a networked storage environment is commonly used in digital preservation environments. Spinning disk storage has quick response times and allows for active monitoring, such as fixity checking, to take place. This type of storage is often highest in cost because the media is expensive, the servers are always on and must be maintained in an environmentally controlled and secure area, and staff must be available to keep the servers up and running.
Magnetic data tape, most commonly LTO tape, is typically used either for nearline or offline storage (described in Section 3.1). Magnetic tape media is less expensive than spinning disk or other low latency storage options, and the cost of managing it over time is greatly reduced, especially for offline storage. Like other removable media, its mediated nature slows access and preservation activities such as active fixity monitoring. However, magnetic tape is much more reliable and far less prone to failure than portable drives and optical disc media such as CDs and DVDs. Magnetic data tape can be stored in what is known as a tape robot, which can provide some automation for access and preservation activities.
Media that enables digital collections managers to actively monitor the health of their collections is always the best choice when deciding on storage options. Luckily, this type of storage also tends to be the most prevalent. No matter what choice you select for storage, though, always backup your data at least once and ideally twice (three copies total), or more.
3.3 Options for Storing Digital Collections
A major factor in deciding what type of storage to choose for digital content management, beyond what your institution already has in place, is cost. And when determining cost, it is important to take all of the costs of managing storage into account. The total cost of ownership (TCO) considers all of the media, labor, and overhead costs that go into installation, ongoing management, and even migration from one storage option to another at some point in the future. Of all the costs, ongoing management is the highest, so institutions more frequently consider cloud storage as a way to alleviate the day-to-day costs and responsibilities for storing digital content. Whether cloud storage is the answer depends on each institution’s local organizational, resource, and technology infrastructure.
Local Storage
Storage offerings are as diverse as the institutions that they serve. They may be online only, or some combination of online, nearline, and offline. In all cases, there are associated costs to managing the servers and media on which digital content is stored. Staffing, facilities, and ongoing management of--and upgrades to--technology must all be factored into the costs of maintaining storage locally (i.e. “on premise”). Digital collections managers should develop strong relationships with IT staff who manage storage at their institution, so that they can work together to build the best local storage environment possible for the digital content they wish to preserve over time.
Cloud Storage
Cloud storage is a service model in which digital content is maintained, managed, backed up remotely, and made available to users over the internet. Examples of cloud storage include Amazon S3, Amazon Glacier, and Google Cloud Storage.
Cloud providers offer different services, features, and performance levels based on costs and the intended market. A few considerations when assessing cloud storage options are:
- Latency. How quickly does the system respond to requests for access to a digital file?
- Geographic diversity. Will your data be stored in one location or backed up to multiple locations?
- Security. What services are in place to ensure your data is safe?
- Disaster recovery. What happens if systems fail?
- Exit path policies. How difficult is it to get your data out, either in chunks or as a whole?
- Costs. What are the costs to upload data into the cloud? What are the ongoing service costs? What does it cost to download your data or exit the service entirely?
Comparison of Cloud and Local Storage
Before deciding on one solution over another, a comparison of the features of each, in relation to the need for long-term management of digital collections, should be undertaken. Some considerations include:
|
||
|
Cloud |
Local |
Cost |
Change in cost of services over time are unknown. Pricing is frequently akin to early cell phone plans; i.e. there are lots of unknowns until you’re “in.” Pay as you go. Only pay for what you use. Amount of admin required is typically unknown up front. |
Most storage systems last 5–7 years. The cost of replacement must be taken into consideration. A significant portion of costs are upfront to pay for new technology, although ongoing costs for staffing and facilities is also a factor. |
Staffing |
Requires some staff to configure options, troubleshoot with technical support, and coordinate efforts. This is less of a staffing burden than local storage. |
Requires dedicated staffing to manage infrastructure and users. |
Support |
Support will be different depending on the service provider. Because the user bases tend to be large, generalized services such as knowledge bases or FAQ pages are available. |
Support is dependent on the IT staff responsible for managing the storage environment. |
Exit Path |
Many cloud storage plans make it cheap to upload content but very expensive to download it. Different cost models exist, and it is important to consider them carefully. |
There is a clear exit path that is straight forward, although it requires more logistical planning and coordination on the part of the IT staff and the digital collections manager. |
Scalability |
It is relatively easy to increase the amount of storage you need—it is cost dependent. |
Typically scalable but may take more staff time and financial and computing resources to grow storage capacity. |
Forward Looking |
Storage and computing in general are trending toward the cloud. |
It may pay off to take a “wait and see” approach with cloud storage, so you have more time to understand the true nature of cloud storage and computing as it matures and is tested over time. |
Sustainability |
Pay-as-you-go provides for more predictable financial planning over a longer period of time but requires continual investment. If funding stops or goes away, there are few/no options for what to do with the stored content. |
Ongoing funding is required, and when technology must be updated, there are short-term capital cost increases. If funding stops or goes away suddenly, the infrastructure exists to buy time while you come up with alternatives. |
Each type of storage has its own financial and organizational implications, and each institution will need to weigh the factors above to come up with a solution that best suits their needs. In some cases, it will not be an either/or decision but a solution that uses both types of storage to their best effect for the institution’s unique situation.
For example, one institution might have a mandate to maintain all collections, whether digital or physical, onsite. In this case, they may opt for a local-only storage solution. Another institution might not have the infrastructure and staff to manage collections onsite, due to costs or personnel restrictions, and may opt instead for cloud storage (from Amazon, Microsoft, Google, etc.) or even a provider like Preservica or DuraCloud that offers a set of preservation services in addition to cloud storage. And, as is more and more often the case, an organization might opt for a hybrid approach. In this case, they may choose to keep a single online copy on local storage so they have quick access to files when they need them. Secondary and tertiary copies may be stored locally on online or nearline storage or in the cloud. Often, yet another copy is stored on magnetic data tape (such as LTOs) in a different geographic location. These second and third backup copies tend to be versions of files that do not need to be accessed readily except for periodic fixity checks. This hybrid approach is an excellent way of (a) alleviating single points of technology failure by distributing content across storage solutions and (b) distributing content across geographically diverse locations.
3.4 Resources
“Nearline storage.” Wikipedia.
https://en.wikipedia.org/wiki/Nearline_storage
“Seagate slapped with a class action lawsuit over hard drive failure rates.” PCWorld.com, February 2, 2016.
http://www.pcworld.com/article/3028981/storage/seagate-slapped-with-a-class-action-lawsuit-over-hard-drive-failure-rates.html
"Storage." Digital Preservation Handbook.
https://www.dpconline.org/handbook/organisational-activities/storage
“What Is Cloud Computing?” PCMag.com, May 3, 2016. http://www.pcmag.com/article2/0,2817,2372163,00.asp
Standards
ISO 14721:2012, Space data and information transfer systems -- Open archival information system (OAIS) -- Reference model.
http://public.ccsds.org/publications/archive/650x0m2.pdf
ISO 16363:2012, Space data and information transfer systems -- Audit and certification of trustworthy digital repositories.
https://public.ccsds.org/pubs/652x0m1.pdf