Documentation and metadata

Would you understand your data in five or ten years' time? If somebody else wanted to use your data in their own research, or wished to replicate your results, what information would they need?

Documentation makes the raw data meaningful and provides the means to validate them and the analyses on which your findings are based. You should record relevant information as soon as possible, and ensure it is stored and organised efficiently. This will make it much easier to use when you need it later on - for example, when you are preparing a dataset for deposit in a data repository at the end of your project.

It can be useful to think of documentation in terms of four levels: variable, file/database, project, and metadata.

Variable

Variable-level documentation defines your variables, and specifies units of measurement and permitted values (including missing value codes). This information is usually embedded within data files, e.g. as a header, or in column labels. Separate worksheets in a spreadsheet file might contain a list of variables with their full definitions and information about units of measurement and permitted values. Variable information may also be recorded as a separate codebook or data dictionary.

File/database

File or database-level information describes the components and logical structure of the dataset. This could be as simple as a listing of files with details of their contents, or a database schema. This information might typically be recorded in a separate readme file.

Project

Project-level information describes the research questions and hypotheses the data are collected to answer or test, the research methodologies, the instruments used to collect and process the data, and records of the research process. There may be standard experimental reporting protocols in your field that you can use to document your methods and instruments. Documentation might include laboratory notebooks, interview schedules, instrument or software specifications and guides, in-line commentary of software code written in the research, interview transcription and anonymisation guidelines, etc. In scientific research the documentation of this information may be more formalised, and may be supported by specific processes or tools. For example, it is increasingly common for study protocols to be publicly pre-registered, and there are a number of online tools such as protocols.io, Benchling, Labstep or RSpace that can be used to record and publish experimental protocols and lab notes.

Metadata

Metadata-level information is a structured description of an information item such as a dataset consisting of a set of defined elements. It is usually created when a dataset is deposited into a data repository or described in a data catalogue, and will be composed of information generated at the first three levels of documentation. The metadata description enables a dataset to be discovered online and provides key information to support continued curation and use of the dataset. Core metadata properties are typically: Creator(s), Title, Publisher, Publication Year, Resource Type, Unique Identifier, e.g. DOI. Additional properties may be included to facilitate discovery and use, such as description, keywords, temporal and geographical references, rights and licence information, and links to related publications.

You will not need to create a metadata record for your data until you have completed data collection and analysis, and are in the final stages of your research or preparing a publication. At this stage you should be thinking of depositing your data in a relevant repository. But if you have identified a specific disciplinary repository that you plan to deposit data in, it is worth familiarising yourself with their metadata requirements, so that you have all the information you need when the time comes. For example, if you are conducting microarray or next-generation sequencing experiments and plan to deposit data in Array Express, you should be prepared to record your experiment using the Minimum Information About a Microarray Experiment (MIAME) or Minimum Information About a Sequencing Experiment (MINSEQE) guidelines.

Robert Darby, Research Data Manager

researchdata@reading.ac.uk

Tel. 0118 378 6161