Born Digital Processing and Preservation Collaborative Project

By: Ima Oduok, Digital Librarian

As I started my second year of residency, I looked for a project that would be more long-term and focused on my interest in gaining hands-on experience in digital processing and preservation. It worked out that a TDL member, the University of Houston Libraries, had a backlog of born digital materials that were donated to Special Collections. These objects had been inventoried and transferred to the network drive but not yet been added to the preservation system, finding aids, or reading room computers. 

Originally, the plan was to process around two different collections of born digital items held in the Special Collections backlog. Due to restrictions of staff time, workloads, and academic holidays, the scope was amended to cover processing of one collection set, focusing on ingesting into a digital preservation system and updating the finding aid. 

Accessioning and appraisal steps had been taken by UH Libraries Special Collections prior to the start of this project. Updates and additions to a draft procedure document for born digital processing are in progress and will be available later this year. 

Selecting a collection to process began with reviewing the Special Collections born digital objects inventory, saved as an Excel spreadsheet. I first went through the information in each row, comparing it to what I found saved on the UH Libraries network drive. Dedicating about 2-4 hours per week to reviewing the inventory spreadsheet, I completed my notes on 87 columns after a month, not including university holidays (the review started just before the weeklong winter break). I also added a tab in the spreadsheet for digital collections I found on the network drive that did not have a row in the inventory. 

spreadsheet inventory of born digital items in University of Houston collections
Screenshot of born digital inventory spreadsheet

From there, three collections were identified as being good candidates for this project. The Gene Green collection consists of two item transfers from the politician’s social media accounts (Facebook and Twitter). The George Baker collection contained over 200 Outlook email files. Finally, the New Music America collection holds transfers of multiple media types from a USB and 3.5” floppy disks. Between these three collections, a variety of media types and file formats are represented, providing a foundation for testing a range of born digital scenarios. We started with the Gene Green collection and adjusted the scope of this project to focus on processing this one collection and crafting workflow recommendations that could be used and adjusted for other born digital collections. 

The digital folder structure of the Gene Green collection was already set up. In consultation with Bethany Scott, the Head of the Preservation and Reformatting Department (PARD), I created a copy of the original files into a working files folder to open the files with minimal risk of corrupting or changing the original files. For this collection, creating a “working files” folder was not necessary since the original transfer contains zip folders and the extracted folder version on the drive was already serving as a workspace for our project experimentation. Creating a separate “working files” folder may be useful for other types of transfers. 

Transfer into Archivematica requires a restructure of the collection subfolders. Before copying the collection folder into the AMSource folder, all item files must be placed within a subfolder labeled “objects”. Any submission documentation, such as the XML file generated by DataAccessioner, should be saved under a “metadata” subfolder. The structure of collection files and folders should be maintained within these subfolders. Lastly, the downloaded configuration file must be added to the transfer folder. 

Archivematica born digital processing configuration for UH, part 1 of 2
Archivematica born digital processing configuration for UH, part 2 of 2
Archivematica processing configuration for born digital materials in test environment

The version of Archivematica maintained at UH is legacy version 1.12, while the most recent version release is 1.16. UH has a test environment for Archivematica which ties into a specific location on the network drive. Moving the prepared folder into this location automatically starts the transfer, and progress can be viewed in the Transfer tab from the AM web interface. 

Transfer page for Gene Green social media archive in Archivematica test environment
Transfer complete in test environment of Archivematica

Once the transfer is complete, the package can be viewed in the Backlog tab. Based on the current configuration for born digital materials in UH’s Archivematica instance, all transfers will be sent to backlog for additional curatorial steps before ingest. Using the Appraisal tab, I was able to view all the files included in the item transfers and analyze them for sensitive information such as personal identifying information (PII) or credit card numbers. For the Gene Green collection, I went through this process to experiment with its use but determined that it was an unnecessary step for this collection since the content contained publicly available social media posts. For collections that contain files that were not once public (emails, for example), it is recommended that analysis for PII and credit card information occur. I did note that the analysis pane in Archivematica did flag one Facebook .html file as containing credit card numbers. I manually scanned the file for potential matches but did not find any sensitive information in the file. Exploring the control and management of false positives for this task may be of interest to future collection processing. 

In the Appraisal tab, there is also an arrangement pane, which can be used to organize files for submission. This is particularly useful if submitted files need reorganization before being ingested. Files in the backlog pane can be dragged and dropped into arrangement, after creating a directory folder to contain them. In the case of the Gene Green collection, the SIP folder structure did not change from the original acquisition. For each item in the collection, I created a SIP directory in the arrangement pane and dragged and dropped the folder structure from the Backlog pane, maintaining the original order. In the first item, the Facebook files, the folder of photos and videos was too large, or contained too many subfolders, to drag and drop the entire folder. Instead, I created a subdirectory in the arrangement pane, and dragged and dropped each subfolder to its corresponding location. For larger collections that may need analysis before ingest, the Archivematica instance should be modified to allow for larger files and folders to be dragged and dropped under the Appraisal tab. 

After creating a SIP in the Appraisal tab, the package will move to the Ingest tab. This process is automated and microservices will run showing the status of each job. Any decision points that were not automated in the process configuration will need archivist intervention at those points in the process. Once the ingest process has completed, the SIP packages can be found in the Archival storage tab and downloaded for review. 

Ingest page for Gene Green social media archive in Archivematica test environment
Successful ingest into Archivematica

Having completed this process in the test instance of Archivematica, we put the items into the production instance. Before moving the prepared items from test to production, Bethany modified the “borndigital” workflow so that the process would skip the backlog. Then, she updated the finding aid in UH’s ArchiveSpace to include the URIs for the digital objects so that the items are now discoverable to researchers. 

Next steps for the Gene Green social media archives will include preparation for digital access. Our Archivematica workflow included the creation of normalized access copies during processing but we were not able to explore more due to time constraints. Bethany and I presented on this project at the Texas Conference on Digital Libraries 2024 last week and our slides will be available in the TDL repository later this summer. The procedure documentation we worked on as part of this project will also be available for public view and use later this year through the TDL Digital Preservation Policy, Workflow, and Documentation Sharing wiki.

Posted in digital preservation, TDL Blog, Texas Digital Library
Tags: , , , ,

Recent Posts