Preparing for data archiving
Archiving data usually involves making them available for other people to consult and use, and you should put as much care into preparing your dataset as you would an article or any other research publication.
If you will be depositing your data in a data repository, you should first look at its guidance on depositing data and note any requirements it may have. Data repositories may have content and metadata requirements for certain types of data, require submission of data in specific formats, and place limitations on the volume of data that can be deposited. Some repositories may also charge for deposit of data (although many do not).
For those intending to deposit data in the University's Research Data Archive we provide a Data Deposit Checklist (PDF) and a more detailed Data Preparation Guide (PDF). While these are aimed at Research Data Archive users, much of the guidance is applicable to anyone preparing a dataset for deposit in a data repository.
These are some of the main things you will need to consider as you prepare your data for archiving.
form the dataset
Archiving data is not as straightforward as transferring the files from your active storage location into a data repository. Your data will need to be tidied up, put into order, and documented. When forming the dataset, consider the following:
- Define the dataset: identify all the files that will compose it. These might include: raw data files (in the initial collection format); processed data files (e.g. cleaned data; raw data saved to another format; statistical analyses and visualisations); documentation; programming code (e.g. analysis scripts).
- Ensure the data are stored in suitable formats for preservation, for example by saving tabular data in an open format such as CSV. Guidance is provided on suitable file formats for preservation.
- Make sure your data files are well-formed and readable. Poorly-presented data are harder to read, more likely to contain errors, and inspire less trust. Check the data for errors. Apply consistent style and formatting, and spellcheck your text. Format code files legibly and include comments to explain what the code is doing. Ensure relevant information is clearly presented in data files, e.g. variable names and definitions, units of measurement, missing value codes, etc. Present actual values; avoid encoded content, such as formulae in spreadsheets and conditional formatting. There is useful guidance on preparing spreadsheet data from the Wellcome Trust.
- Redact the data as necessary. Data collected from research participants may need to be anonymised. There is guidance on anonymisation provided by the UK Data Service. Other kinds of information may also need to be removed or obscured, such as commercially-confidential information, locations of endangered species, etc.
- If the dataset is composed of multiple files, make sure they are organised in a logical fashion.
- Use appropriate and consistent file names, which are descriptive of the file contents, formatted without spaces or special characters, and not longer than 32 characters. Guidance on file naming is provided on the Organising your data web page.
- Check the size of the dataset and make sure it does not exceed any size limitations specified by your chosen data repository. If you have a large dataset and/or a large number of files, it may be easier for both you and prospective users of the data to use an archive format to package/compress the files. Zip and tar.gz are good choices, as they provide lossless compression.
- You could ask a colleague or peer to review your dataset. A pair of eyes unfamiliar with the data may spot mistakes and things you have overlooked. Remember that the people reading your data will have not have your experience of the research context.
Prepare the documentation
Every dataset should have at least a basic manual or user guide. This should include the following:
- citation metadata for the dataset (creators, title, publication year);
- identification of the rights-holder(s) with licence statements;
- an abstract for the dataset providing details of the research project and the purpose for which the data was collected;
- a description of the contents of the dataset, e.g. as a file listing;
- key interpretative information, e.g. a full definition of variables and units used, such as a codebook or data dictionary;
- details of the methods and instruments used to collect, process and analyse the data, and relevant supporting information, such as analysis scripts;
- references to any secondary data sources used;
- references to related publications. If a publication in process, as much information as possible should be provided to enable identification of the published item, e.g. authors, provisional title, journal (if known), year and status (in preparation/under review, in press).
For deposits in the University's Research Data Archive, a README template (txt) is provided, which can be used to record basic documentation. Documentation can be saved in PDF, Word or another text format as preferred.
Check your consents
If data have been collected from living persons, you should check that you have properly-documented consent for data sharing. It is acceptable to disclose data obtained from human subjects without consent if the data have been fully anonymised, but it is good practice to inform participants of your intention to do this. Note that it is not acceptable to disclose even anonymised data if in your consent procedure you specifically stated that the data would not be disclosed, or would be destroyed at a given time.
Consent is best obtained before data collection, but it may be possible to obtain consent retrospectively. In some cases, for example in qualitative research involving the collection of sensitive information, a process consent model may be appropriate. This might involve, for example, obtaining consent prior to conducting an interview, and later seeking consent publish the prepared transcript of the interview.
Identify dataset creators
You should be clear about this, as ownership rights and permission to distribute the dataset will be associated with its creators (see following). Datasets may be the work of many hands, and it is not always easy to clearly distinguish its creators from other people who contributed to the work of the project.
Creators are those who have had direct input in creating the dataset. In most cases, a project PI or student supervisor will not be a creator of the dataset, unless they had a direct hand in its creation. Technicians, contractors and others collecting data under instruction are not usually creators of the dataset, as they generally have no creative input into how the data are selected and presented.
Identify the rights-holders
This is important, because your authorisation to archive and distribute the dataset depends on the permission of the rights-holders.
Owners of intellectual property rights (IPR) in the data will be associated with the creators of the dataset. In general, an employer will own IPR created by its employees: the University is ordinarily the rights-holder in a dataset created by its employees. Research contracts generally allow IPR to reside with the originating institutions. Students own the IP create by default, but this may not be the case if they are sponsored by a third party, e.g. under a CASE or industrial sponsorship, or if they have assigned their IP to the University.
If a dataset has multiple creators, it may also have multiple rights-holders, which may include the University, students in their own right, and collaborating and partner organisations. There is more guidance on the Intellectual property rights and research data web page.
You may need to investigate any applicable research contracts or studentship agreements to establish what parties hold rights in a dataset. Contact us if you have questions about a research contract. If you need to locate a copy of a contract, contact your Contracts Manager.
Where datasets incorporate secondary data, the owners of these data will also have the rights to determine how and on what terms their data are distributed by you.
Decide your licensing preferences
Be aware that where permission must be sought from other parties to publish the data, some negotiation over licence terms may be necessary.
Obtain permissions if necessary
You must ensure that you have permission to archive and distribute the dataset from: the creators; the rights-holders; parties with contractual rights regarding publication of research outputs; secondary data owners.
As a matter of course and courtesy, you should ensure the dataset is archived with the knowledge and permission of its creators, who will be publicly identified as such.
Where the employer is a University or publicly-funded research organisation, permission to publish the data can be inferred from their policy position on research data, which is, certainly in the case of universities, to promote the public sharing of data supporting research outputs wherever possible. Other parties, including students, industrial studentship sponsors and commercial research partners, will need to give written consent to publication of the dataset.
Parties to contracts
Research and studentship contracts have Publication clauses, which generally grant other parties the right to be notified of and have the opportunity to approve or delay any intended publication. This right exists irrespective of who owns the IP created under the contract. The standard notice period is 30 days.
Secondary data owners
To seek permissions, you should write to the parties concerned, and request permission in writing. Research contracts and sponsorship agreements will nominate a legal officer or other contact for each party, to whom any notices under the contract can be directed.
When contacting other parties for permission to archive and distribute data, it is important to identify the data unambiguously, and to be clear how the data will be made available, and on what terms they will be licensed for use. While you should always seek to licence the dataset on the most open terms, other rights-holders may legitimately require more restrictive licensing. For example, a commercial partner may not be willing to distribute a dataset under terms that permit re-use for commercial purposes.