Getting Started with DPN

 

By Ashley Adair, Digital Archivist

Below is an article written by Ashley Adair, a Digital Archivist at the University of Texas at Austin Libraries. Ashley writes about her experience navigating UT Libraries’ first deposit for the Digital Preservation Network – the first deposit of any member of the Texas Digital Library consortium.

In 2016, UT Libraries in Austin began preparing our inaugural deposit for the Digital Preservation Network (DPN). In order to set the stage for a discussion of the work that we’ve done to date on our DPN deposit, I’m going to first offer some background. I’ll then describe two different cases of content that we are currently ingesting. And, finally, I’ll note where we are currently and touch on next steps.

Background: UT Libraries, TDL, TACC, and DPN

UT Libraries, as most people know, was a founding member of and is the physical home to the Texas Digital Library. We’re also a charter member of DPN. I work as Digital Archivist in the Digital Stewardship unit of the Libraries, leading digital preservation efforts. A colleague from my unit, Benn Chang, provides technology coordination and support and he has been working a lot with TDL to test and facilitate DPN ingests through the DuraCloud console. The Texas Advanced Computing Center (TACC) figures in DPN work by providing part of the Texas Preservation Node storage architecture, and because I collaborate with them on a research data cyberinfrastructure from which we selected some legacy data to include in UT Austin’s first-year DPN deposit. We have a total of 5TB allocation, split between UT Libraries collection material and this research data.

Step 1: Deciding Content to Preserve

UT Libraries’ content selection was guided by many of the same principles that other users of DPN have noted– such as whether the content is unique to our collection, generally regarded as being of high informational value or cultural interest, and whether it is already preserved in another long-term solution. We were also realistic that our first year attempting to deposit in DPN might bring challenges on its own, so we took a pragmatic approach in identifying content that was already in good shape to be readied for ingest, while fitting within these broader criteria.

In the end, we selected our collection of scanned Relaciones Geograficás and our scanned University of Texas Bulletins and Publications. The Relaciones Geograficás are 16th century manuscripts and maps ordered by King Philip II of Spain as a survey of holdings in what was then New Spain. These are routinely cited as one of the ‘gems’ of UT Libraries, the physical items are irreplaceable and hold high scholarly interest, and the scans were created in recent  memory to nice technical specifications and with some readily available basic metadata, and so they were a good fit for this effort. The University of Texas Bulletins and Publications represent many decades of the scholarly work of the University in all subject areas, and were scanned in their entirety for inclusion in our institutional repository, again, with metadata handy, making them another good candidate for our first year deposit.

Step 2: Bagging a “Gem” of the Library

For each of these, we created enhanced packages for DPN compared to what we have stored on LTO tape, which is our general archival storage. We recalled all of the digital assets for the collections from tape, along with the technical metadata that we stored with them. We then augmented these with copies of descriptive metadata records where available, and with a copy of the DPN contract with UT. We knew that we would bag the content for DPN, so we devised an approach to each content set to use for the bag-info file (the bag’s basic, plain text metadata file) that would help to give us each element of a basic OAIS Archival Information Package.

Carrying out the approach on each item represented a lot of work by various technicians in our unit, such as rounding up OCLC numbers or DSpace handles, as the case may be, carefully transferring individual item titles into each item’s bag-info file, and generally going through, in very routinized, granular way making a record out of the file that would make it perfectly clear what the data in each bag is, what is its context, to what other items does it relate, and so on, without anyone from our institution needing to be there to describe it or explain anything. Keeping in mind that DPN is meant to outlast technological changes and staffing turnovers, our goal was truly to make simple, straightforward data packages that would be self-evident and useful on their own. Again, this was a lot of work, but it also helped push us to improve local digital preservation practices in ways that we are already implementing.

In terms of local recordkeeping for our deposits, we are retaining vestiges or ‘shells’ of our DPN bags in a network storage space, which are the bags minus their actual data content. These include the bag-info file and all of the bag tag files, so that we can reference them later should we need to. It is easy to make a connection between these information packages and the assets on tape, so we really haven’t seen a need to re-vault the entire DPN augmented packages to tape.

Step 3: Bagging TACC Research Data

 As I mentioned, TACC research data also made up a significant portion of our first DPN deposit allocation. In this case, TACC staff members identified legacy data for which they currently have stewardship as a good proof of concept case for DPN. I work with TACC on data curation and preservation aspects of a natural hazards engineering cyberinfrastructure called DesignSafe, which is funded by the National Science Foundation (NSF). As part of winning the grant for this project, TACC took responsibility for all of the data stored in the system that predates DesignSafe, which is called the NEESHub. These are data generated in experimental facilities, such as tsunami wave basins, wind tunnels, or shaking and centrifuge machines that replicate earthquake forces. The data selected form only a part of the NEESHub total data, and were selected for this first year based on size to fit within the allocation.

I worked with a TACC systems administrator, who pulled copies of the data from storage at TACC and prepared bag-info metadata according to a specification that I provided him, using the bag-info file in much the same way that we had with the Libraries. We had to strategize how we would actually bag the data since the projects are too large to fit through the ingest pipeline in one go. The only way that someone could reuse any of this data is to have the total project components, so being able to piece things back together on the other end of DPN was very important, and we used bag group identifiers and bag count fields to make the relationships among bags and projects clear. We found that queries of the legacy system’s database to gather comprehensive metadata for each project were largely unsuccessful, so the system administrator also did web scrapes of each project’s information in its native NEESHub environment for good measure and included a copy in one bag for each project.

For recordkeeping in this instance, we had to take a different approach since NSF projects change homes every five to ten years by design, but DPN is such a long-term effort. We assigned an ARK to each bag in EZID and used that as the external identifier in the bag’s bag-info file. This way, when the project moves to a new home and someone else is custodian of the data, it will be straightforward to contact EZID and have the ownership of those identifiers moved over. We also created the same sort of ‘shell’ bags with all of the metadata and none of the content and put these, as well as documentation from the project, into the cyberinfrastructure itself so that it can all port to any eventual new home, regardless of staffing changes in the meantime.

Finally, UT Libraries agreed to handle the actual ingest for TACC as part of our role in partnering with TDL in the Texas Preservation Node. We encountered about the amount of issues that one would expect trying to complete this large of a data transfer over the network from TACC (located about ten miles away) to the UT campus, even though we’re part of the same institution. The issues were not insurmountable, just something to realistically anticipate. We also ended up having to repackage almost everything that we received from TACC when we realized a difference in how the DuraCloud console calculates data sizes versus the tools that were used to create the bags. This also wasn’t insurmountable, but it did mean that we would recommend much smaller data packages going forward just to be safe.

Step 4: Thinking Ahead

So, where are we now? Nearly all of our packages, both for the Libraries and TACC, are ready to go in a staging area for ingest into DPN. To date, we have ingested about 60% of the 2016 allocation. There have been some technical delays in the process, but we’re moving along. We’re also now looking carefully at what the metadata for the ingests looks like on the other side, in terms of what we get back from DuraCloud and the bridge, and making decisions about what we want to capture from there, where and how. In this our goals are to augment local recordkeeping about content that has gone into DPN and devise a sort of deposit ‘receipt’ for partners for whom we facilitate ingests.

Finally, we’re thinking ahead to formalizing and enhancing, as an organization, our content selection process now that we have a year focused really on pragmatism to learn from. Depositing content in DPN ensures that students and faculty of the University of Texas at Austin, as well as researchers from the wider world, will have access to digital assets reflecting our world-class collections and scholarly output in the long term, guarded against potential technical, economic or natural difficulties in our local context. Enhancing the processes by which we prioritize and prepare content for DPN will help us make the best use of this opportunity.

Through Texas Digital Library’s membership in the Digital Preservation Network, DPN storage is available to members for a one-time $2,750 fee per TB which will cover preservation services for 20 years. Learn more about TDL’s partnership with DPN and other TDL digital preservation services at tdl.org.

Ashley encourages fellow TDL members (current or future) to email her with questions about how her team successfully ingested our first DPN deposit. You can reach Ashley at a.adair@austin.utexas.edu.

Posted in News, Texas Digital Library, Useful resources

Categories