Organising your data
Implementing a logical and consistent system to organise your data files allows you and others to locate and use them effectively, and helps to preserve the integrity of your data.
There are three main elements to data organisation:
- a filing system;
- file naming rules;
- a version control policy.
These elements are discussed below. When considering how you will organise your data, bear the following principles in mind:
- Use established conventions and procedures if they exist and meet your needs. Your research group or laboratory may already have standard protocols.
- Ensure everyone involved in the project is aware of and follows the policy. If a policy is not observed or is applied inconsistently it has little value.
- Keep your policies and practices under review. Don't leave files unsorted, hanging under top level folders; weed and tidy your folders periodically, removing redundant files.
- You may want to maintain a retention schedule, with retention and review periods for designated files. This would be particularly important if you are collecting personal data, which will need to be processed lawfully and securely destroyed when no longer required. A simple spreadsheet can be used as a retention schedule.
These principles apply to information in whatever format it is held, whether physical or digital.
The following guidance is mostly addressed to the storage of digital information. For further information on good practice see guidance on data organisation from the UK Data Service and MIT Libraries.
Use a logical, hierarchical folder structure to store your files, grouping files in categories, and descending from broad high-level categories to more specific folders within these. There is no single right way to do this; the important thing is that the structure is logical, legible, and meaningful for its purpose. For example, you could organise files into folders according to task (e.g. work package, experiment), then a significant defining property (e.g. location, sample number, run, company name) or type of data (e.g. raw, processed, final). You would probably have separate folders for data, administrative documentation, publications, etc.
Don't let your folder structure become too complicated, and avoid too many layers in your hierarchy (three is comfortable; ideally no more than four at the most).
Confidential information, for example, participant records, should be stored in separate folders with appropriate access controls. The owner of a fileshare on the University network can manage individual access permissions to the fileshare and folders within it. For more information, consult IT guidance on managing the security groups for collaborative file shares (login required).
Raw data and milestone document versions should be saved as read-only files, ideally in separate folders. Contact IT if you require assistance managing file permissions.
Intelligent use of file naming enables you and others to easily identify the contents of a file, and can be used to organize and version-control files. This principle applies whether you are storing digital or physical materials. It can be very important if you are generating large numbers of files, for example by some automated process.
You don't have to force all files into a rigid convention, but if you adopt some basic standards they will help you find and organise files. For example, by always writing dates in YYYMMDD format, you will be able to sort files chronologically. The following suggestions may help you to develop a serviceable file naming protocol:
- Use short but meaningful file names using significant elements, e.g. ABCProject_Interview_SmithJohn_2014-06-18. You should be able to tell what is in a file by looking at the filename. Some properties you might use include: project identifier, data collection method or instrument, data type, location, subject, date, version number.
- Don't make files names to long (32 characters should be the maximum. Avoid redundant information in file names and file paths.
- Avoid spaces in file names; you can use _ or - to separate elements, or run them together using CamelCase.
- Consider the sort order of your files, as this will aid identification and retrieval. Files will sort according to the types of characters used in their names, with special characters first (e.g. @), followed by numbers, and then alphabetic characters. For example, the file datafile.txt, if renamed, would sort in this order @_datafile.txt, 001_datafile.txt, 20190731_datafile.txt.
- Write dates in reverse from larger units to smaller (e.g. 2014-06-18) to sort chronologically; write numbers using leading zeros (i.e. 001, 002 etc., not 1, 2, etc.) to sort numerically; and enter names surname first (e.g. SmithJohn);
- Embed version control in filenames where this is relevant: date and time or version numbers will enable accurate identification of current and previous versions of files.
Version control or versioning is a system to record changes to a file or set of files over time. It is important at all times when working with digital items, which can easily be modified. It is essential if you are working in a research group and sharing and modifying files between yourselves. Uncontrolled versions of files modified by different people can easily proliferate, causing you to lose track of your data and the transformations they have gone through. In the worst-case scenario this can compromise the integrity of the data - for example, if a raw data file is overwritten.
There are some simple things you can do to put in place effective version control. Not all of the following need be used. It will depend on the nature of the work and the processes the data undergo. More detailed guidance on version control is available from the UK Data Service.
- Only allow authorised users to modify files. Use access control and read/write permissions in files/storage areas to restrict the ability to modify files to authorised users only.
- Raw data files, master and milestone versions of files should be made read-only and stored in separate designated locations under a nominated authority.
- Store non-current versions of files in separate folders. You may not need to keep all old versions of files, but it may be good practice to retain milestone versions or old master files;
- Use file sharing services such as Dropbox or Google Drive to synchronise versions of files stored in multiple locations, or use versioning software, e.g. Subversion (SVN), MS SyncToy;
- Document changes in a version control table within the document. This should contain headings for Version number, Author, Purpose/Change, and Date;
- Use file names to identify versions, e.g. draft, final, v_001.