Quality control is essential to maintaining the integrity and viability of data. Every action performed on data, from the point of collection onwards, presents an opportunity for errors to be introduced. You must therefore implement procedures to reduce the risk of introducing errors in the data, and to mitigate the impact of errors when they occur.
Various quality control strategies can be used:
- Map the entire data workflow, from the point of collection to the final format dataset, and break it down into actions performed on the data (see the example below). For every action, identify the quality control procedure to be applied. Ensure the quality control procedures are consistently performed and documented where relevant.
- Ensure actions are reversible. Do not overwrite raw data; save modified files with new names, ideally using a version number or date in the filename to allow clear version identification.
- Standardise and document your workflows, so that another person could follow your instructions and achieve the same result as you, for example, by writing a step-by-step protocol for data collection, or guidelines for formatting and anonymisation of interview transcriptions. Follow established procedures where relevant, such as laboratory Standard Operating Procedures: these have been tried and tested.
- If you are conducting experimental scientific research, consider using an online tool such as protocols.io, which allows you to record, annotate and publish detailed information about experimental procedures. You can develop and annotate your protocols (in a closed group or in public) over time in a version-controlled process, and published versions can be assigned DOIs and linked from the methods section in a paper.
- Define your data structures and data collection forms or templates in advance. For example, set up a spreadsheet with variables clearly labelled in column headings, including units of measurement. Provide instructions for data entry in a separate worksheet or document. Your documentation should include a full definition of variables, and information about permitted values for given variables (including missing value codes).
- Make use of any data validation functions in your software, e.g. Excel allows you to specify permitted values for a cell or range of cells.
- Methods such as double entry of data and random sample checking can reduce the incidence of error.
- Review data to check they make sense. Data visualisation can help to identify suspicious outliers and anomalies: a trendline with an obvious spike in it may highlight a suspicious value.
The UK Data Service provides guidance on quality control.
Example data workflow