The file formats you use for your data may affect what you can do with the data and how effectively they can be preserved and shared. In practice your choice of file formats may be governed by the standards in your discipline, or the types of hardware and software you use in your research, but you should follow best practice principles as far as possible.
The UK Data Service provides detailed advice on formatting your data, including recommendations on optimal formats for preservation. The University provides guidance on recommended file formats (PDF) for deposit of data in the Research Data Archive.
Proprietary and open formats
File formats may be proprietary, such as Microsoft Excel and Adobe PDF, or open, such as comma-separated values (CSV) or Open Document Format (ODF).
The best formats for data collection and analysis may not be the most suitable formats for long-term data preservation. Proprietary formats can provide rich highly-specified functionality, but may limit the usability of your data and be high-risk in the long-term, as they are commercial products, available under licence only and prone to obsolescence.
Open formats may lack rich functionality and be more generic, but they provide high usability and carry a low risk over the long term because there are no licences fees, their specifications are publicly available, and they can be rendered by multiple software packages.
Working and preservation formats
For day-to-day working, use file formats that are fit for purpose and accessible to your research group. For example, you may use Microsoft Excel for quantitative data analysis and visualisation.
For long-term preservation, where possible, you should store data in open or widely-used formats, and plan for conversion from proprietary formats where necessary. For detailed information about any of the formats mentioned below, refer to Library of Congress format assessments.
Suitable preservation formats may be:
- open formats, such as CSV for tabular data, ASCII text (.txt) and PDF/A for text and documentation, XML with an appropriate Document Type Definition (DTD) for structured machine-readable information, JPEG for images, FLAC for audio, and MPEG-4 for video. Included in this category are self-describing formats encoded in text files, where the file contains a header with information about the variables reported in the body of the file: examples include the NetCDF format used in climate system models, and the FASTA format for representing nucleotide or peptide sequences;
- widely-used proprietary formats, such as MS Excel and MS Access for tabular data and databases, MS Word for text, TIFF 6.0 uncompressed for images, and MP3 or WAV for audio.
For example, raw instrument data in a proprietary format may be preserved, but also or alternatively converted into an ASCII/CSV format, to be more widely accessible; data analysed in a proprietary software, such as MATLAB or SPSS, should be preserved in a format accessible to users without a software licence.
In some conversions you may lose rich features and formatting, but you have a greater chance of retaining the integrity of the content in the long run. If the richer features provided by a proprietary format add value to your data, you can always retain the data in that format as well. Popular formats such as Adobe PDF and those of Microsoft applications are likely to endure for many years.
Image and audiovisual files may need to be preserved at the most information-rich level in order to support future uses, but practical consideration of usability may also enter in. For example, an uncompressed TIFF file will preserve the highest level of information; by comparison a lossy compressible format such as JPEG while preserve less information, but has practical benefits in that file sizes will be smaller and faster to serve online.
Use of open programming languages such as Python and R to process and analyse data can have functional advantages over 'point-and-click' proprietary software, as well as being intrinsically reproducible.
For example, to undertake statistical analysis of your data you could use SPSS, which is proprietary software, and requires a licence. Because operations are performed by interacting with a Graphical User Interface, there is no script of your operations that can automated. Anyone wishing to replicate your analysis would need to access SPSS, import your data, and reconstruct the analysis on the basis of information provided by you.
If instead you used the free programming language R, you can conduct your analysis without having to access proprietary software, and you will be able to preserve the full analysis workflow by saving your scripts to text files. You or anyone else with these scripts can then re-run exactly the same analysis by executing the code; because the analysis is automated, it is guaranteed to be reproducible. Because a software licence is not required to run the analysis, it is also a more transparent method.