Preparing for data archiving
Datasets can be valuable research outputs, and you should put as much care into preparing a dataset as you would any other research output.
This web page provides guidance on some of the main things to consider and address before you start to deposit a dataset in a data repository. A more detailed version of this guidance is provided in Preparing for Data Archiving (PDF). For those intending to deposit data in the University's Research Data Archive we also provide a Data Deposit Checklist (PDF).
1. Define the dataset
Before you deposit your dataset, you will need to define it. It is important to identify the contents of your dataset, as these will also determine what preparation is necessary. A systematic value assessment of your data can help you to make an informed decision about what to preserve and share. We provide a set of appraisal criteria in our Data Selection and Appraisal Checklist.
2. Identify the repository and check its requirements
We provide guidance on choosing a data repository.
You should check your preferred repository's guidance on depositing data and note any requirements it may have. Repositories may have content and metadata requirements for certain types of data, require submission of data in specific formats, and place limitations on the volume of data that can be deposited. Some repositories may also charge for deposit of data (although many do not).
Some repositories, including the Research Data Archive, can manage higher-risk anonymised data and data containing identifiable information under a controlled access procedure. See Preparing for Data Archiving (PDF) for more information.
3. Check your consents
If data have been collected from living persons, you should check that you have properly-documented consent for data sharing. It is acceptable to disclose data obtained from human subjects without consent if the data have been fully anonymised, but it is good practice to inform participants of your intention to do this. It is not acceptable to disclose even anonymised data if in your consent procedure you specifically stated that the data would not be disclosed, or would be destroyed at a given time. Identifiable data can be disclosed under a controlled access procedure, providing that participants have consented to participate in the study on the understanding that data would be shared in this way. The University provides a sample consent form including statements suitable for open data sharing and sharing of data subject to safeguards.
If you are depositing data collected from participants in the Research Data Archive, you will be required to submit your information sheet(s) and sample consent form(s) used in participant recruitment with your data files, so that we can confirm you have a basis for data sharing. These documents will be stored in the dataset as Documentation files. Access to them will be restricted, meaning they will not be available for users download.
Consent is best obtained before data collection, but it may be possible to obtain consent retrospectively. In some cases, for example in qualitative research involving the collection of sensitive information, a process consent model may be appropriate. This might involve, for example, obtaining consent prior to conducting an interview, and later seeking approval of the anonymised transcript of the interview prior to archiving.
4. Identify dataset creators
It is important to understand who is a creator of your dataset – as well as who is not – because intellectual property rights and permission to distribute the data will be associated with its creators. Creators of datasets also have the moral right to be identified as such. Datasets may be the work of many hands, and it is not always easy to clearly distinguish its creators from other people who contributed to the work of the project.
According to the Copyright, Designs and Patents Act 1988 a database is ‘a collection of independent works, data or other materials which – (a) are arranged in a systematic or methodical way, and (b) are individually accessible by electronic or other means’. It is ‘the selection or arrangement of the contents of the database’ that constitutes the creative act which attracts copyright.
Therefore, creators are those who have had a direct creative role in the selection and arrangement of data in the dataset. This is not the same as being involved in the design of the research or in the original data collection. In most cases, a project PI or student supervisor will not be a creator of the dataset, unless they had a direct authorial hand in its creation. Technicians, contractors and others involved in the collection of data are not usually creators of a dataset, unless they had creative input into the selection and arrangement of the data points.
Anyone who does not meet the definition of a Creator but has contributed to the production of the dataset can still be acknowledged for their contribution in the dataset documentation. The Research Data Archive includes a Contributors field in its metadata schema.
5. Identify the rights-holders
You must clearly identify rights-holders, because your authorisation to archive the dataset depends on their permission. Remember that by archiving data you are also distributing them, and doing this without the authorisation of the rights-holder will be a breach of copyright law.
Owners of intellectual property rights (IPR) in the data will be associated with the creators of the dataset.
In general, an employer will own IPR created by its employees: the University is ordinarily the rights-holder in IP created by members of staff. Research contracts generally allow ownership of ‘arising IP’ (i.e. created under the contract) to reside with the originating institution.
Students registered with the University own the IP they create by default, but this may not be the case if they are funded under a third-party sponsorship agreement (excluding public funders such as Research Councils, which do not assign student IP to other parties), or if they have assigned their IP to the University. Usually the third party sponsor is a company, e.g. Syngenta, Waitrose, but it may also be a Government-funded agency, such as the Met Office, or a charity that is not primarily a research funder, such as the Donkey Sanctuary. A sponsorship agreement will include Intellectual Property clauses stating which party has ownership of arising IP. Ownership of IP created by a student at another institution will be subject to that institution's IP policy and any relevant agreements.
If a dataset has multiple creators, it may also have multiple rights-holders, which may include the University, students in their own right, and collaborating and partner organisations. There is more guidance on the Intellectual property rights and research data web page.
You may need to investigate any applicable research contracts or studentship agreements to establish what parties hold rights in a dataset. Students and/or their supervisors should have copies of any contracts relating to their research programmes. If you need to locate a copy of a contract, contact your Contracts Manager. Contact us if you have questions about a research contract.
Where datasets incorporate secondary data, the owners of these data will also have the rights to determine how and on what terms their data are distributed by you.
6. Decide your licensing preferences
In order to license the data you must be the data owner or authorised to assign a license on behalf of the data owner, so the choice of licence may be subject to the permission of other parties. For example: a third-party co-creator with commercial interests may request the application of a non-commercial licence; if the dataset incorporates third-party materials these may be made available on an ‘All Rights Reserved’ basis.
Data held under a controlled access policy (such as UK Data Service safeguarded data and restricted datasets in the Research Data Archive) will be made available under special licence terms. The Data Access Agreement for restricted datasets deposited in the Research Data Archive allows data to be used, subject to authorisation, in confidence for non-commercial research and learning purposes only. The Agreement will be made between the University and the organisation to which the authorised user is affiliated.
As a general rule we recommend you use the Creative Commons Attribution licence for open data, and this is the default applied to uploaded files in the Research Data Archive. More restrictive licences should only be used if there is a justification for doing so, for example, to protect commercial or other confidential interests.
7. Obtain permissions if necessary
You must ensure that you have permission to archive and distribute the dataset from: the creators; the rights-holders; parties with contractual rights regarding publication of research outputs; secondary data owners.
Creators of datasets have the moral right in copyright law to be identified as such. Individuals also have the moral right not to have a work falsely attributed to them as an author. You must therefore ensure that dataset is archived with the knowledge and permission of its creators.
Where the employer is a University or publicly-funded research organisation, permission to publish the data can be inferred from their policy position on research data, which is, certainly in the case of universities, to promote the public sharing of data supporting research outputs wherever possible. Other parties, including students, industrial studentship sponsors and commercial research partners, will need to give written consent to publication of the dataset.
Parties to contracts
Research and studentship contracts have Publication clauses, which generally grant other parties the right to be notified of and have the opportunity to approve or delay any intended publication. This right exists irrespective of who owns the IP created under the contract. The standard notice period is 30 days.
Secondary data owners
To seek permission, you should write to the parties concerned, and request permission in writing. Research contracts and sponsorship agreements will nominate a legal officer or other contact for each party, to whom any notices under the contract can be directed.
When contacting other parties for permission to archive and distribute data, it is important to identify the data unambiguously, and to be clear how the data will be made available, and on what terms they will be licensed for use. While you should always seek to licence the dataset on the most open terms, other rights-holders may legitimately require more restrictive licensing. For example, a commercial partner may not be willing to distribute a dataset under terms that permit re-use for commercial purposes.
8. Form the dataset
Archiving data is not as straightforward as transferring the files from your active storage location into a data repository. Your data will need to be tidied up, put into order, and documented. When forming the dataset, consider the following:
- Define the dataset: identify all the files that will compose it. These might include: raw data files (in the initial collection format); processed data files (e.g. cleaned data; raw data saved to another format; statistical analyses and visualisations); documentation; programming code (e.g. analysis scripts).
- Ensure the data are stored in suitable formats for preservation, for example by saving tabular data in an open format such as CSV. You may need to check that for any file format requirements specified by your chosen repository. Guidance is provided on suitable file formats for preservation in the Research Data Archive.
- Make sure your data files are well-formed and readable. Poorly-presented data are harder to read, more likely to contain errors, and inspire less trust. Check the data for errors. Apply consistent style and formatting, and spellcheck your text. Format code files legibly and include comments to explain what the code is doing. Ensure relevant information is clearly presented in data files, e.g. variable names and definitions, units of measurement, missing value codes, etc. Present actual values; avoid encoded content, such as formulae in spreadsheets and conditional formatting. There is useful guidance on preparing spreadsheet data from the Wellcome Trust.
- Redact the data as necessary. Data collected from research participants may need to be anonymised. There is guidance on anonymisation provided by the UK Data Service. Other kinds of information may also need to be removed or obscured, such as commercially-confidential information, locations of endangered species, etc. Link-coded data, where data records are identified by a unique code which is linked to identifiable participant information held in a separate table, are in data protection law still personal data. They are pseudonymised, not anonymised. For a dataset to be anonymised, and suitable for sharing as open data, you will need to remove any means of linking data records to identifiable participants, e.g. by destroying all documented records of the link, or by replacing linked IDs in the dataset with unlinked IDs.
- If the dataset is composed of multiple files, make sure they are organised in a logical fashion.
- Use appropriate and consistent file names, which are descriptive of the file contents, formatted without spaces or special characters, and not longer than 32 characters. Guidance on file naming is provided on the Organising your data web page.
- Check the size of the dataset and make sure it does not exceed any size limitations specified by your chosen data repository. The Research Data Archive allows the deposit of datasets up to 20 GB free of charge and recommends that individual files be no larger than 4 GB. If you have a large dataset and/or a large number of files, it may be easier for both you and prospective users of the data to use an archive format to package/compress the files. Zip and tar.gz are good choices, as they provide lossless compression.
- You could ask a colleague or peer to review your dataset. A pair of eyes unfamiliar with the data may spot mistakes and things you have overlooked. Remember that the people reading your data will have not have your experience of the research context.
9. Prepare the documentation
Every dataset should have at least a basic manual or user guide. This should include the following:
- citation metadata for the dataset (creators, title, publication year);
- identification of the rights-holder(s) with licence statements;
- a brief description of the dataset. This might include summary information about what and how much data were collected, the research context in which they were collected, the purpose for which they were collected, and the instruments and methods used;
- information about the project in which data were collected, with any external funding details;
- a description of the contents of the dataset, e.g. as a file listing;
- key interpretative information, e.g. a full definition of variables and units used, such as a codebook or data dictionary;
- details of the methods and instruments used to collect, process and analyse the data, and relevant supporting information, such as analysis scripts;
- references to any secondary data sources used;
- references to related publications. If a publication in process, as much information as possible should be provided to enable identification of the published item, e.g. authors, provisional title, journal (if known), year and status (in preparation/under review, in press).
For deposits in the University's Research Data Archive, a README template (txt) is provided, which can be used to record basic documentation. Documentation can be saved in PDF, Word or another text format as preferred.