It is important for you to decide what formats to choose for your research data when you start to plan your research projects, as it determines how the data may be used, analysed, stored, and reused in the future.
Here are some questions you may need to consider.
You are recommended to keep a copy of data in the original file format when converting it to another file format. The original file can be used to repair unexpected damages during the conversion. For example, file conversions may have certain risks of information loss as listed below:
Open/Non-proprietary formats Different from proprietary formats that are owned by individuals or corporations, open formats are developed and maintained by communities of interest. You are recommended to use open, non-proprietary formats which have a higher likelihood of long-term sustainability. For example, if your research data is created by a proprietary programme which is the only option to access and analyse the data, your data may not be usable or accessible when the programme or software is no longer available.
|
"Lossless" formats Please aware that some compressed formats may sometimes result in information loss though they are smaller in size. You are recommended to adopt "lossless" formats when you keep your research data.
|
Unencrypted and uncompiled formats Unencrypted and uncomplied formats have a higher level of sustainability as recompiling is possible on different architectures and platforms. |
Commonly used by the research community It is good for you to select a file format that is commonly used in your research areas. It will be easier for your research data to be shared and re-used by research communities or other interested parties. |
Adapted from Source: File formats, Library of Congress
Here is the optimal file formats table adopted from UK Data Archive.
Type of data | Acceptable formats for sharing, reuse and preservation | Other acceptable formats for data preservation |
---|---|---|
Quantitative tabular data with extensive metadata a dataset with variable labels, code labels, and defined missing values, in addition to the matrix of data |
SPSS portable format (.por) delimited text and command ('setup') file (SPSS, Stata, SAS, etc.) containing metadata information some structured text or mark-up file containing metadata information, e.g. DDI XML file |
proprietary formats of statistical packages e.g. SPSS (.sav), Stata (.dta) MS Access (.mdb/.accdb) |
Quantitative tabular data with minimal metadata a matrix of data with or without column headings or variable names, but no other metadata or labelling |
comma-separated values (CSV) file (.csv) tab-delimited file (.tab) including delimited text of given character set with SQL data definition statements where appropriate
|
delimited text of given character set - only characters not present in the data should be used as delimiters (.txt) widely-used formats, e.g. MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf) and OpenDocument Spreadsheet (.ods) |
Geospatial data vector and raster data |
ESRI Shapefile (essential - .shp, .shx, .dbf, optional - .prj, .sbx, .sbn) geo-referenced TIFF (.tif, .tfw) CAD data (.dwg) tabular GIS attribute data
|
ESRI Geodatabase format (.mdb) MapInfo Interchange Format (.mif) for vector data Keyhole Mark-up Language (KML) (.kml) Adobe Illustrator (.ai), CAD data (.dxf or .svg) binary formats of GIS and CAD packages |
Qualitative data textual |
eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml) Rich Text Format (.rtf) plain text data, ASCII (.txt) |
Hypertext Mark-up Language (HTML) (.html) widely-used proprietary formats, e.g. MS Word (.doc/.docx) some proprietary/software-specific formats, e.g. NUD*IST, NVivo and ATLAS.ti
|
Digital image data | TIFF version 6 uncompressed (.tif) |
JPEG (.jpeg, .jpg) but only if created in this format TIFF (other versions) (.tif, .tiff) Adobe Portable Document Format (PDF/A, PDF) (.pdf) standard applicable RAW image format (.raw) Photoshop files (.psd) |
Digital audio data |
Free Lossless Audio Codec (FLAC) (.flac) |
MPEG-1 Audio Layer 3 (.mp3) but only if created in this format Audio Interchange File Format (AIFF) (.aif) Waveform Audio Format (WAV) (.wav) |
Digital video data |
MPEG-4 (.mp4) OGG video (.ogv, .ogg) motion JPEG 2000 (.mj2) |
MOV (.mov) Windows Media Video (WMV) (.wmv) WebM (.webm) |
Documentation and scripts | Rich Text Format (.rtf) PDF/A or PDF (.pdf) HTML (.htm) OpenDocument Text (.odt) |
plain text (.txt) some widely-used proprietary formats, e.g. MS Word (.doc/.docx) or MS Excel (.xls/.xlsx) XML marked-up text (.xml) according to an appropriate DTD or schema, e.g. XHMTL 1.0 |