By Nicholas Woodward, Sr. Software Engineer, Texas Digital Library
I’ll start with a little backstory.
Texas Digital Library was formed in 2005 among four ARL libraries in Texas in order to pool resources to build capacity for preserving, managing, and providing access to unique digital collections of enduring value.Among other things, we host institutional repositories, a consortial data repository, Open Access journals, and ETD management tools.
TDL currently has 23 institutions as members, and we are adding a 24th member in Fall 2019. Seventeen of our members use our repository hosting service, which includes:
- 21 hosted DSpace installations
- 10 TB of content and approximately 195,000 items
Our newest member using our repository hosting service is the University of Texas Health Science Center. There is no such thing as a “typical” repository migration, but UNTHSC’s onboarding and migration was more challenging than others because their content was stored on a bepress repository.
The case study below outlines TDL’s process for migrating from UNTHSC’s content from bepress to a TDL-hosted DSpace. I originally presented this case study at the DSpace North American User Group hosted by the University of Minnesota Libraries in September 2019.
The University of North Texas Health Science Center (UNTHSC) approached the Texas Digital Library (TDL) in Spring, 2018 about migrating their scholarly repository hosted on the Digital Commons platform from bepress to a new DSpace 6.3 repository instance hosted by TDL.
UNTHSC and TDL agreed to collaboratively develop a workflow and timeline for the migration that incorporated existing DSpace tooling, custom code and shared documents.
In case you’re unfamiliar with bepress, here is a screenshot from UNTHSC’s former repository home page.In the image below, you can see the hierarchy of communities/collections that we knew we would need to recreate in DSpace.
Below is a typical item view page in bepress with the standard metadata and option to download. Notice below the Download link the count of the number of times the document has been downloaded.
At the onset of the project we worked with UNTHSC to develop a migration workflow that we could test from beginning to end. We began by working through each step of the process with a subset of the repository to create a minimum viable product, or MVP, of the migration. From there we could then iteratively develop greater capabilities until every stop of the migration worked for the entire repository.
Here’s what that workflow looked like:
- Transfer digital objects along with their metadata
- Generate communities and collections in DSpace
- Create Simple Archive Format packages
- Ingest the packages into DSpace
- Customize the look-and-feel and configuration for UNTHSC
Step 1 | Transfer Digital Objects Along with Their Metadata
Step one involved working with UNTHSC’s bepress Archive:
- Complete up-to-date backup of the repository
- AWS S3 storage
- Accessible with AWS credentials
- Includes metadata-only items
Below is an example of the metadata files that are in the bepress Archive. One thing to notice is that there are no namespaces, so it is not validated against any schemas. The other thing, and you can’t see it in this graphic, is that the XML is encoded in ISO Latin 1.
Step 2 | Generate Communities and Collections in DSpace
As with all good software projects, we began with a spreadsheet (see image below). The metadata in the bepress Archive links items to collections in a sort of roundabout fashion. Additionally, UNTHSC wanted to rearrange their repository, creating several new communities/collections and moving some existing items around.
Below is a closer view where you can see instances of what will become top-level communities, subcommunities, and collections in DSpace. We needed a way to specify the end nodes of the tree, meaning the collections. And we settled on the pipe character that would eventually serve a second purpose.
If you look after the pipe characters in the spreadsheet, you’ll see there are paths that correspond to the digital objects in the bepress Archive. In this way we could match item with collection, even in cases where the items would be moving to a different collection.
The image below shows the Ruby code that is used to transform the spreadsheet into an input XML file containing the communities and collections that DSpace uses to create them in the repository.
And below you can see the output of the Ruby code that the DSpace command line job uses as input. Two things to notice here: the DSpace job gives us the new handles of the collections, and we’ve stored the bepress Archive paths in the short description field of the collection.
Step 3 | Create Simple Archive Format Packages
In order to import the digital objects from bepress into DSpace we quickly settled on a DSpace standard for representing items. The standard contains the metadata in a standardized format, the digital objects themselves and list of collection(s) they are in.
Back to the metadata stored in the bepress Archive… one thing to notice is there are no namespaces, HTML entities have been converted, and most importantly, the encoding is specified as ISO Latin 1.
Thankfully we have some experience with mapping metadata to a range of standards and formats and metadata application profiles. We repurposed some existing code to map the bepress archive metadata into both unqualified Dublin Core and the thesis namespace for UNTHSC’s electronic theses and dissertations.
The end result is metadata in a format that DSpace is expecting for the Simple Archive Format packages.
Step 4 | Ingest SAF Packages into DSpace
By matching the bepress Archive path in the metadata with the short description of the corresponding collection, we can match the two in the SAF package.
The metadata stored in the bepress Archive is almost complete, but early on in the process we discovered that there were a handful of fields that weren’t present in S3 but were available via the OAI-PMH feed. These included the dc.type and dc.format fields. We were able to associate the items in the bepress Archive with their metadata records in OAI-PMH and add those fields to the final metadata.
Additionally, UNTHSC wanted to store the download statistics for their items in Digital Commons. This information is not stored in the bepress Archive or the OAI-PMH feed, but it is available via custom metadata exports in bepress. So again, we found a way to associate all of the download statistics with an item, and after UNTHSC determined they wanted to store that info dc.provenance.legacyDownloads we added it to the mix:
OAI-PMH metadata feed | dc.type and dc.format fields
Custom metadata reports from bepress | Legacy download statistics – dc.provenance.legacyDownloads
Step 5 | Customize Configuration for UNTHSC
In order to put it all together, we:
- Utilized Ansible playbooks to launch DSpace 6.3 instance with Shibboleth authentication, ORCID integration, etc.
- Executed a final sync process to get UNTHSC’s repository from bepress Archive
- Generated communities/collections hierarchy from the spreadsheet
- Built SAF packages for all items
- Executed a DSpace import command line job to ingest all repository content Solr indexing, and media filtering
- Enabled GUI customization
Ta da! Have a look at the final product: A TDL-hosted repository for UNTHSC.
SUCCESSES AND CHALLENGES
Our successes included:
- Migrated ~3,700 items w/metadata and supporting objects
- Developed workflow using existing DSpace tooling, spreadsheets, and custom Ruby to “glue” the steps together
- Achieved repository alignment with an open, supportive consortium of Texas Digital Library
Our biggest challenges were:
- Text encoding issues with the ISO Latin 1 format
- New items and metadata edits were delayed to bepress Archive
- Metadata in the bepress Archive lacked key fields like type and format
UNTHSC’s new DSpace repository was launched on September 16. We finished DSpace services setup, including Google Analytics, RDF, etc, and completed the remaining GUI and configuration customization. We will soon publish our workflow Ruby code to Texas Digital Library’s GitHub repository.
Production release was October 21st. You can view the UNTHSC Scholar repository at https://unthsc-ir.tdl.org/.