LibGuides: Research Data Management: Data Formats

Choosing File Formats

It is important for you to decide what formats to choose for your research data when you start to plan your research projects, as it determines how the data may be used, analysed, stored, and reused in the future.

Here are some questions you may need to consider.

What types of data will be generated?
Are you using file formats that are standard to your files?
Are you using file formats that are commonly used in your research areas?
Are these formats easy to share with your colleagues or others who need access to the data?
Are these formats facilitate use and re-use of your data in the future (e.g., open standard/non-proprietary)?
Are there any special conditions to read and manipulate your research data (e.g., operating systems, software or tools)?

Changing File Formats

You are recommended to keep a copy of data in the original file format when converting it to another file format. The original file can be used to repair unexpected damages during the conversion. For example, file conversions may have certain risks of information loss as listed below:

Loss of content (data)
Loss of characteristics of the file stored within the file (metadata)
Loss of layout or formats ( e.g. in text files)
Loss of quality ( e.g. in graphic or video files)

Source: Research data management - looking after file formats, University of Amsterdam

Guidelines for Selecting File Formats

Open/Non-proprietary formats

Different from proprietary formats that are owned by individuals or corporations, open formats are developed and maintained by communities of interest.

You are recommended to use open, non-proprietary formats which have a higher likelihood of long-term sustainability. For example, if your research data is created by a proprietary programme which is the only option to access and analyse the data, your data may not be usable or accessible when the programme or software is no longer available.

Examples of proprietary formats: .psd, .xlsx
Examples of non-proprietary formats: .tiff, .txt,.csv

"Lossless" formats

Please aware that some compressed formats may sometimes result in information loss though they are smaller in size. You are recommended to adopt "lossless" formats when you keep your research data.

Example of lossy formats: .mp3, .jpeg
Example of lossless formats: .wav, .tiff

Unencrypted and uncompiled formats

Unencrypted and uncomplied formats have a higher level of sustainability as recompiling is possible on different architectures and platforms.

Commonly used by the research community

It is good for you to select a file format that is commonly used in your research areas. It will be easier for your research data to be shared and re-used by research communities or other interested parties.

Adapted from Source: File formats, Library of Congress

Optimal File Formats

Here is the optimal file formats table adopted from UK Data Archive.

Type of data	Acceptable formats for sharing, reuse and preservation	Other acceptable formats for data preservation
Quantitative tabular data with extensive metadata a dataset with variable labels, code labels, and defined missing values, in addition to the matrix of data	SPSS portable format (.por) delimited text and command ('setup') file (SPSS, Stata, SAS, etc.) containing metadata information some structured text or mark-up file containing metadata information, e.g. DDI XML file	proprietary formats of statistical packages e.g. SPSS (.sav), Stata (.dta) MS Access (.mdb/.accdb)
Quantitative tabular data with minimal metadata a matrix of data with or without column headings or variable names, but no other metadata or labelling	comma-separated values (CSV) file (.csv) tab-delimited file (.tab) including delimited text of given character set with SQL data definition statements where appropriate	delimited text of given character set - only characters not present in the data should be used as delimiters (.txt) widely-used formats, e.g. MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf) and OpenDocument Spreadsheet (.ods)
Geospatial data vector and raster data	ESRI Shapefile (essential - .shp, .shx, .dbf, optional - .prj, .sbx, .sbn) geo-referenced TIFF (.tif, .tfw) CAD data (.dwg) tabular GIS attribute data	ESRI Geodatabase format (.mdb) MapInfo Interchange Format (.mif) for vector data Keyhole Mark-up Language (KML) (.kml) Adobe Illustrator (.ai), CAD data (.dxf or .svg) binary formats of GIS and CAD packages
Qualitative data textual	eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml) Rich Text Format (.rtf) plain text data, ASCII (.txt)	Hypertext Mark-up Language (HTML) (.html) widely-used proprietary formats, e.g. MS Word (.doc/.docx) some proprietary/software-specific formats, e.g. NUD*IST, NVivo and ATLAS.ti
Digital image data	TIFF version 6 uncompressed (.tif)	JPEG (.jpeg, .jpg) but only if created in this format TIFF (other versions) (.tif, .tiff) Adobe Portable Document Format (PDF/A, PDF) (.pdf) standard applicable RAW image format (.raw) Photoshop files (.psd)
Digital audio data	Free Lossless Audio Codec (FLAC) (.flac)	MPEG-1 Audio Layer 3 (.mp3) but only if created in this format Audio Interchange File Format (AIFF) (.aif) Waveform Audio Format (WAV) (.wav)
Digital video data	MPEG-4 (.mp4) OGG video (.ogv, .ogg) motion JPEG 2000 (.mj2)	MOV (.mov) Windows Media Video (WMV) (.wmv) WebM (.webm)
Documentation and scripts	Rich Text Format (.rtf) PDF/A or PDF (.pdf) HTML (.htm) OpenDocument Text (.odt)	plain text (.txt) some widely-used proprietary formats, e.g. MS Word (.doc/.docx) or MS Excel (.xls/.xlsx) XML marked-up text (.xml) according to an appropriate DTD or schema, e.g. XHMTL 1.0

Type of data

Acceptable formats for sharing, reuse and preservation

Other acceptable formats for data preservation

Quantitative tabular data with extensive metadata

a dataset with variable labels, code labels, and defined missing values, in addition to the matrix of data

SPSS portable format (.por)

delimited text and command ('setup') file (SPSS, Stata, SAS, etc.) containing metadata information

some structured text or mark-up file containing metadata information, e.g. DDI XML file

proprietary formats of statistical packages e.g. SPSS (.sav), Stata (.dta)
MS Access (.mdb/.accdb)

Quantitative tabular data with minimal metadata

a matrix of data with or without column headings or variable names, but no other metadata or labelling

comma-separated values (CSV) file (.csv)

tab-delimited file (.tab)

including delimited text of given character set with SQL data definition statements where appropriate

delimited text of given character set - only characters not present in the data should be used as delimiters (.txt)

widely-used formats, e.g. MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf) and OpenDocument Spreadsheet (.ods)

Geospatial data

vector and raster data

ESRI Shapefile (essential - .shp, .shx, .dbf, optional - .prj, .sbx, .sbn)

geo-referenced TIFF (.tif, .tfw)

CAD data (.dwg)

tabular GIS attribute data

ESRI Geodatabase format (.mdb)

MapInfo Interchange Format (.mif) for vector data

Keyhole Mark-up Language (KML) (.kml)

Adobe Illustrator (.ai), CAD data (.dxf or .svg)

binary formats of GIS and CAD packages

Qualitative data

textual

eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml)

Rich Text Format (.rtf)

plain text data, ASCII (.txt)

Hypertext Mark-up Language (HTML) (.html)

widely-used proprietary formats, e.g. MS Word (.doc/.docx)

some proprietary/software-specific formats, e.g. NUD*IST, NVivo and ATLAS.ti

Digital image data

TIFF version 6 uncompressed (.tif)

JPEG (.jpeg, .jpg) but only if created in this format

TIFF (other versions) (.tif, .tiff)

Adobe Portable Document Format (PDF/A, PDF) (.pdf)

standard applicable RAW image format (.raw)

Photoshop files (.psd)

Digital audio data

Free Lossless Audio Codec (FLAC) (.flac)

MPEG-1 Audio Layer 3 (.mp3) but only if created in this format

Audio Interchange File Format (AIFF) (.aif)

Waveform Audio Format (WAV) (.wav)

Digital video data

MPEG-4 (.mp4)

OGG video (.ogv, .ogg)

motion JPEG 2000 (.mj2)

MOV (.mov)

Windows Media Video (WMV) (.wmv)

WebM (.webm)

Documentation and scripts

Rich Text Format (.rtf)
PDF/A or PDF (.pdf)
HTML (.htm)
OpenDocument Text (.odt)

plain text (.txt)

some widely-used proprietary formats, e.g. MS Word (.doc/.docx) or MS Excel (.xls/.xlsx)

XML marked-up text (.xml) according to an appropriate DTD or schema, e.g. XHMTL 1.0

Source: File formats recommended by the UK Data Service