Research data defined
'the evidence that underpins the answer to the research question' Concordat on Open Research Data
'recorded factual material commonly retained by and accepted in the scientific [research] community as necessary to validate research findings' EPSRC Policy Framework on Research Data
Research data are the raw materials collected, processed and studied in the undertaking of research. They are the evidential basis that substantiates published research findings.
They may be primary data generated or collected by the researcher, or secondary data collected from existing sources and processed as part of the research activity.
In addition to the 'raw' data, research data include information about the means necessary to generate data or replicate results, such as computer code, experimental methods and instruments used, and essential interpretive and contextual information, e.g. specifications of variables.
The raw data of research may exist in digital and non-digital formats, and may be broadly divide into five classes:
Facts recorded directly in real time from the physical and social environment, e.g. measurements collected by weather sensors, species abundance surveys, archaeological samples, brain scan images, experience and opinion surveys in the social sciences. These data are often unique to time and place and by definition cannot be reproduced.
Data collected as the outputs of field or laboratory experiments and complex analytical processes, e.g. clinical trial data, chemical analyses of physical samples, DNA sequencing of organic material, field trial results. These data are generally in principle reproducible, assuming the experimental conditions can be replicated.
Data generated by means of computational 'virtual experiments', often used to model complex systems and processes, e.g. climate and weather simulations, models of market processes. These data are usually reproducible, given information about the model, the code and computing environment used to execute the model, and any input conditions. This information may in fact be more important that the output data.
Derived or compiled
Datasets produced by processing or combining source data, e.g. databases compiled by extraction of information from multiple secondary sources, collections of digitized materials, corpora collected by means of text mining.
Published and curated data, usually existing as part of managed collections, e.g. national statistics archives, crystallographic databases, gene banks.