Data Management Plans

In developing a data management plan, keep in mind that funding agencies, such as the National Science Foundation (NSF), may have specific content requirements. Certain data management considerations are fundamental across all disciplines, however, and planning for them will be beneficial to any data gathering project, regardless of funding requirements.

Generally speaking, any data management plan should touch on the topics listed below. Information in blue boxes is relevant specifically to those using a DSpace repository to store and disseminate data. Please see Data Management and the TDL for more information about the benefits and limitations of DSpace as a data management tool.

1. Describe the project, its purpose, and the organizations and staff involved.

2. Describe the data to be collected, the method of collections, the nature of the data, and its format.

Areas to address include:

Type of data (e.g. numerical, image, text, modeling, etc.)
Special storage needs required by your data
Amount of data produced and the growth rate (how fast will your data grow?)
Frequency of change and processes for keeping track
Potential uses for your data and needs for access
Person(s) in charge of data
Data retention (will you keep data for 3-5 years? 20 years? Indefinitely?)

3. Identify the person(s) responsible for data management.

Who are the people in charge of data management during your research project and over time?

If you use a TDL DSpace repository to store your data, you may list your repository’s manager as one of the responsible parties in your long-term data management plan.If you do not know who manages your repository, contact your library or the TDL Program Coordinator at info@tdl.org for this information.

4. Explain how data will be documented throughout the research project.

Data documentation (or “metadata”) is an important component in providing usable access to your data. Without documenting the data, potential users will not be able to find, understand, cite, or properly use the data you have published.

Numerous documentation/metadata standards exist. The DDI (Data Documentation Initiative) specification is one such standard that is designed for social and behavioral science data.

The MIT Libraries Data Management Guide provides the following guidelines for aspects of your data you should plan to document:

Title Name of the dataset or research project that produced it
Creator Names and addresses of the organization or people who created the data
Identifier Number used to identify the data, even if it is just an internal project reference number
Subject Keywords or phrases describing the subject or content of the data
Funders Organizations or agencies who funded the research
Rights Any known intellectual property rights held for the data
Access information Where and how your data can be accessed by other researchers
Language Language(s) of the intellectual content of the resource, when applicable
Dates Key dates associated with the data, including: project start and end date; release date; time period covered by the data; and other dates associated with the data lifespan, e.g., maintenance cycle, update schedule
Location Where the data relates to a physical location, record information about its spatial coverage
Methodology How the data was generated, including equipment or software used, experimental protocol, other things one might include in a lab notebook
Data processing Along the way, record any information on how the data has been altered or processed
Sources Citations to material for data derived from other sources, including details of where the source data is held and how it was accessed
List of file names List of all data files associated with the project, with their names and file extensions (e.g. ‘NWPalaceTR.WRL’, ‘stone.mov’)
File Formats Format(s) of the data, e.g. FITS, SPSS, HTML, JPEG, and any software required to read the data
File structure Organization of the data file(s) and the layout of the variables, when applicable
Variable list List of variables in the data files, when applicable
Code lists Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. ‘999 indicates a missing value in the data’)
Versions Date/time stamp for each file, and use a separate ID for each version
Checksums To test if your file has changed over time (see section on backups)

If you are storing your data in a TDL DSpace repository, some of this metadata will be captured as part of the file record (title and creator information, for example), and checksums are routinely run for all files in TDL-hosted DSpace repositories to detect corrupted data. Other metadata fields can be uploaded in a readme.txt file or the equivalent along with the dataset.You should consult with your institution’s repository manager to determine best practices for documenting data in your institution’s DSpace repository.

Additional Metadata Resources:

University of Oregon Metadata Information: http://libweb.uoregon.edu/datamanagement/metadata.html

5. Describe your short-term and long-term storage plans, including backup procedures.

Short-term storage may be handled by the project team; for instance, the team might store data initially on an office computer and perform periodic backups on an external hard drive.

For longer term storage or for storing large amounts of data, however, your project may require institutional support.

Researchers at TDL member institutions have the option to deposit datasets in a DSpace institutional repository. This may not be an ideal solution for large amounts of data. If you have a large dataset to store, especially if you are actively using the data, you can consult with the TDL to see if our cloud storage capabilities or storage agreement with the Texas Advanced Computing Center can help meet your needs for data storage. (See Data Management and the TDL for more information.)

If you choose to store and disseminate your data in a TDL-hosted DSpace repository, website, wiki, or journal, your data will be backed up on a regular basis.

TDL Backup Policy, including for DSpace Repositories

Files deposited in the Texas Digital Library (TDL) are written to two (2) ‘On-Line’ copies within the data center and two (2) ‘Off-Line Archive’ copies outside of the data center.

The ‘On-Line’ data is written to two separate storage volumes. A storage volume is a highly available, highly reliable, persistent storage. Each storage volume is replicated within the Data Center. This prevents data loss due to failure of any single hardware component.

The ‘Off-Line Archive’ copies are made nightly and weekly and are available across multiple data centers. One (1) nightly copy is kept. Three (3) weekly copies are kept. The integrity of the ‘Off-Line’ data is verified using checksums. If corruption is detected, it is repaired using an alternate copy.

Publishing data in a subject-specific repository or online journal could also provide reliable long-term storage for your data. Listed below are resources for finding an appropriate subject repository:

https://www.lib.umn.edu/datamanagement/datacenters

6. Explain your plan for making the data available for public use and potential secondary uses.

The Internet provides numerous possibilities for publishing data and research, including on websites, in wikis, and in e-journals. The Texas Digital Library, in fact, provides services for setting up a project website or wiki and also hosts e-journals through the TDL Electronic Press.

Institutional or subject repositories are another common and effective way to store and disseminate data, as they provide ways of curating, preserving, and facilitating discovery of the data that other online publication tools do not.

The TDL hosts DSpace repositories for each of its member institutions that provide an effective means of archiving and disseminating research, datasets, and other materials.

DSpace and Data Access

DSpace is one of the most widely used platforms for open access digital repositories at colleges and universities. It contains built-in workflows for submitting data in any file format and makes the data easily discoverable and accessible online. Content stored in a DSpace repository is indexed by commercial search engines (such as Google) and OAI-PMH harvesters and can be accessed via the repository’s search and browse functions.

Additionally, the TDL repository service provides enhanced file access controls and a third-party, persistent URL service for citing and linking to data. Access restrictions and embargo periods for data can be applied as requested by the researcher and data files easily downloaded by either authorized users or the public.

7. Explain your plans to ensure a long-lived format.

File format is an important consideration when it comes to ensuring future availability of your data. As technologies change, some software, and thus the file formats they support, will become obsolete. As a result, it is important to store your data in a format that has the best chance to be accessible far into the future.

The best file formats are ones that have the following characteristics:

Non-proprietary
Open, documented standard
Common usage by research community
Standard representation (ASCII, Unicode)
Unencrypted
Uncompressed

For long-term storage, you should consider moving your data to a file format that has these characteristics.

Examples of preferred format choices for TDL-hosted DSpace repositories:

PDF/A (not Word or other proprietary word processing software)
ASCII (not Excel)
MPEG-4 (not Quicktime)
TIFF or JPEG2000 (not GIF or JPG)
XML or RDF (not RDBMS)

Checksums are routinely run for all files in TDL-hosted DSpace repositories to detect corrupted data. Files will be refreshed by TDL staff as required.

8. Describe arrangements to protect participant confidentiality and/or intellectual property, and outline other legal or ethical considerations.

As you collect and publish data, you must consider your and others’ rights and responsibilities in regard to the confidentiality of research subjects, as well as intellectual property rights.

Confidentiality

There are numerous factors to consider in maintaining the confidentiality of research subjects. You can publish data on research subjects if you take the necessary step to maintain participant confidentiality. Make sure you comply with university regulations, health research regulations, and ethical guidelines for responsible collection, retention, and sharing of data.

Tips for keeping confidential information secure:

Keep confidential data off the Internet where it can be discovered and viewed.
Put sensitive materials containing direct or indirect identifiers that could be used with other public information to identify research participants on computers not connected to the Internet.
Restrict access to buildings and rooms where computers or media are kept.
Only let trusted individuals troubleshoot computer problems.
Keep virus protection up to date on computers that contain your data.
Don’t send confidential data via e-mail or FTP (use encryption, if you must).
Use passwords on files and computers.

For more information about legal and ethical considerations for research involving human and non-human animal participants, consult your institution’s Institutional Review Board guidelines. When obtaining informed consent from research subjects, be sure to tell participants about how data will be stored and used and how confidentiality will be maintained.

ICPSR (the Inter-University consortium for Political and Social Research) also provides guidance on protecting privacy and confidentiality of study participants:

Additional Resources:

Online Ethics Center for Engineering and Research
MIT Data Management Guide
HIPAA Privacy Rules for Researchers

Intellectual property:

There are two sets of intellectual property issues around sharing data – one for sharing data that you have produced or collected and one for sharing data that you gathered from other sources.

When using a TDL DSpace repository, intellectual property information can be included in the metadata record for your data.

Data in and of itself is not copyrightable, but it can be licensed and the license attached to data may limit how it can be used by others.

For data you produced or collected yourself:

You cannot copyright your data (though you can copyright a published chart or table as a tangible graphic expression).
You can license your data in ways that limit how the data can be used (e.g. you may need to protect privacy of participants; you may want to require attribution; you may want to forbid for-profit use).
You can promote sharing and unlimited use of your data by publishing it under a Creative Commons CCO license.

For data you have collected from other sources:

You must determine whether you have the right to redistribute the data, based on the license (if any) attached to it.