Your research activity will generally create a lot of material, and understanding how to handle this is not always straightforward.
Creating and Using Research Data
Understanding the difference between research “data” and research “records” is often the first hurdle.
“Will I need this material to support a publication, or validate my research findings?”
“Will this item form part of a finalised data set once my work is complete?”
If the answer is “yes” to either question this will be part of your research data. Research records will usually need to be kept too, for audit purposes.
Research data can include:
Funding bodies usually like to see that the data you gather or create fills a gap in knowledge and require you to demonstrate this. It is often cost effective to re-use data created elsewhere in different ways, perhaps creating a “mash-up” of data from different sources to demonstrate something new. This is attractive to funding bodies, because it means they are not funding the same data gathering exercises twice.
Many public funding research bodies and publishers are now requiring that data is publicly available. You need to understand the terms of your funding agreement before you start, to make sure you take this into account.
Data Management Planning
A Data Management Plan (DMP) help researchers and research students with their research methodology. Data Management Planning is an RGU requirement and in many cases it is now becoming a Funder requirement at the point of submission.
A DMP covers the following basics:
To help researchers, templates are available via the DMPonline tool. The tool includes video tuition and RGU users can login using their institutional credentials. Researchers who plan to submit to a funder where there is no prepared template can still use this tool, which will provide a standard simple template. Research students can also use this template for planning data handling during their studies.
Workshops on data management planning and data handling are held regularly throughout the academic year.
Know your legal, ethical and other obligations regarding research data, towards research participants, colleagues, research funders and institutions
In April 2010, the Digital Curation Centre (DCC) launched DMP Online, a web-based tool designed to help researchers and other data stakeholders develop data management plans according to the requirements of major research funders.
Using the tool researchers can create, store and update multiple versions of a data management plan at the grant application stage and during the research cycle. Plans can be customised and exported in various formats. Funder- and institution-specific best practice guidance is available.
The tool combines the DCC’s comprehensive ‘Checklist for a Data Management Plan’ with an analysis of research funder requirements. The DCC is working with partner organisations to include domain- and subject- specific guidance in the tool.
The Rural Economy and Land Use (RELU) Programme has been at the forefront of implementing data management planning for research projects since 2004. Drawing on best practice in data management and sharing across three research councils (ESRC, NERC and BBSRC), RELU requires that all funded projects develop and implement a Data Management Plan to ensure that data are well managed throughout the duration of a research project. In a data management plan researchers describe:
The format and software used to create research data depends on the hardware or software used or how researchers plan to analyse data and in some cases by discipline-specific standards and customs.
Despite the backward compatibility of many software packages to import data created in previous software versions, the safest option to guarantee long-term data access is to convert data to standard formats.
Well-organised file names and folder structures make it easier to find and keep track of data files. Develop a system that works for your project and use it consistently. Whilst computers add basic information and properties to a file, this is not reliable data management. It is better to record essential information in file names or through the folder structure. Think carefully how best to structure files in folders, in order to make it easy to locate and organise files and versions. When working in collaboration the need for an orderly structure is even higher.
It is important to ensure that different versions of files, related files held in different locations, and information that is cross-referenced between files are all subject to version control. It can be difficult to locate a correct version or to know how versions differ after some time has elapsed.
It is important to keep track of master versions of files, for example the latest iteration, especially where data files are shared between people or locations, e.g. on both a PC and a laptop. Checks and procedures may also need to be put in place to make sure that if the information in one file is altered, the related information in other files is also updated.
Because digital information can be copied or altered so easily, it is important to be able to demonstrate the authenticity of data and to be able to prevent unauthorised access to data that may potentially lead to unauthorised changes.
Quality control of data is an integral part of all research and takes place at various stages. It is important to assign clear roles and responsibilities for data quality assurance at all stages of research and to develop suitable procedures before data gathering starts. Quality control measures during data collection may include:
Good quality and consistent transcription conventions include transcription instructions or guidelines and a template to ensure uniformity across a collection. Full transcription is recommended for data sharing. If transcription is outsourced take care with:
Data documentation explains how data was created, what it means, content and structure. It is part of good practice when creating, organising and managing data and is important to create sufficient contextual information to make sense of the data. Documentation may include:
Metadata is the label attached to data to describe it. It is extremely important, because most people will forget the details of what a data file or data set contains. Typically metadata will include information on
Good data documentation includes:
Good file naming conventions:
Best practice to ensure authenticity is to:
Version Control tips include:
Researchers using qualitative data analysis packages, such as NVivo 9, to analyse data can use a range of the software’s features to describe and document data. Such descriptions both help during analysis and result in essential documentation when data is shared, as they can be exported from the project file alongside data at the end of research. Researchers can create classifications for persons (e.g. interviewees), data sources (e.g. interviews) and coding. Classifications can contain attributes such as the demographic characteristics of interviewees, pseudonyms used, and the date, time and place of interview. If researchers create generic classifications beforehand, attributes can be standardised across all sources or persons throughout the project. Existing template and pre-populated classification sheets can be imported into NVivo.
Documentation files like the methodology description, project plan, interview guidelines and con-sent form templates can be imported into the NVivo project file and stored in a ‘documentation’ folder in the Memos folder or linked from NVivo 9 externally. Additional documentation about analyses or data manipulations can be created in NVivo as memos. A date- and time-stamped project event log can record all project events carried out during the NVivo project cycle. Additional descriptions can be added to all objects created in, or imported to, the project file such as the project file itself, data, documents, memos, nodes and classifications. All textual documentation compiled during the NVivo project cycle can later be exported as textual files; classifications and event logs can be exported as spreadsheets to document preserved data collections. The structure of the project objects can be exported in groups or individually. Summary information about the project as a whole or groups of objects can be exported via project summary extract reports as a text, MS Excel or XML file.
Online documentation for a data collection in the UK Data Archive Catalogue can include project instructions, questionnaires, technical reports, and user guides. Researchers typically create metadata records for their data by completing a data centre’s data deposit form or metadata editor, or by using a metadata creation tool, like Go-Geo! GeoDoc16 or the UK Location Metadata Editor17. Providing detailed and meaningful dataset titles, descriptions, keywords and other information enables data centres to create rich resource-discovery metadata for archived data collections. Data centres accompany each dataset with a bibliographic citation that users are required to cite in research outputs to reference and acknowledge accurately the data source used. A citation gives credit to the data source and distributor and identifies data sources for validation.
The Wessex Archaeology Metric Archive Project has brought together metric animal bone data from a range of archaeological sites in England into a single database format. The dataset contains a selection of measurements commonly taken during Wessex Archaeology zoo- archaeological analysis of animal bone fragments found during field investigations. It was created by the researchers in MS Excel and MS Access formats and deposited with the Archaeology Data Service (ADS) in the same formats. ADS has preserved the dataset in Oracle and in comma- separated values format (CSV) and disseminates the data via both as an Oracle/Cold Fusion live interface and as downloadable CSV files.
The JISC-funded Data Management for Bio-Imaging project at the John Innes Centre developed Bioformats Converter software to batch convert bio–images from a variety of proprietary microscopy image formats to the Open Microscopy Environment format, OME-TIFF.21 OME-TIFF, an open file format that enables data sharing across platforms, maintains the original image metadata in the file in XML format.
You’ve invested a lot of time and effort in creating your data, so keep it safe. Throughout the life of your project you need to continuously think about solutions for storing data carefully. Many forms of storage media are inherently unreliable, and all file formats and physical storage media will ultimately become obsolete.
Making back-ups of files is an essential element of data management which protect against accidental or malicious data loss through:
It is worthwhile checking that you can recover the files you have backed up. External cloud based storage is a good solution, but double check the security features offered, including recovery of files. If you plan to store any business critical or personal information make sure your chosen method complies with Data Protection legislation and best practice.
Sharing data between collaborators is a challenge. Anything sent by email persists in a number of unknown exchange servers – the sender’s, the receiver’s and others in-between – so relying on this as a method of data transfer is not good practice. Cloud-based or online file sharing services may be suitable for sharing certain types of data, but they are not recommended for data that may be confidential, because users do not control where data is ultimately stored. Researchers should be aware of the risks and benefits of each type of solution so they can make informed decisions about which to use.
In terms of long term storage of complete data sets once you are ready to publish, RGU library can help you protect, preserve, archive, and share your research data.
All research activity associated with RGU is an asset of the University and so RGU has a responsibility to secure, store and access all research data, within the bounds of any IP or confidentiality agreement.
To ensure this, RGU is providing R:\drives for researchers, including research students. These provide additional basic data storage space, which can be shared with named individuals who have an RGU login e.g. PIs, research team members, research students and supervisors. They do not provide additional processing or compute power.
Research students will have an R:\drive created for them shortly after they commence their studies, typically when they have completed Module 1 of PGCert.
Data is held securely and privately and so the R:\drive is ideal for confidential or sensitive data. The R:\drive can be accessed via Citrix remotely in the same way as H:\drives.
A research team carrying out coral reef research collects field data using handheld Personal Digital Assistants (PDAs). Digital data are transmitted daily to the institution’s network drive, where they are held in password-protected files. All data files are identified by an individual version number and creation date. Version information (version numbers and notes detailing differences between versions) is stored in a spreadsheet, also on the network drive. The institution’s network drive is fully backed-up onto Ultrium LTO2 data tapes. Incremental back-ups are made daily Monday to Thursday; full server back-ups are made from Friday to Sunday. Tapes are securely stored in a separate building. Upon completion of the research the data are deposited in the institution’s digital repository.
In February 2008 the British Library (BL) received the recorded output of the Survey of Anglo-Welsh Dialects (SAWD), carried out by University College, Swansea, between 1969 and 1995. This survey recorded the English spoken in Wales by interviewing and tape- recording elderly speakers on topics including the farm and farming, the house and housekeeping, nature, animals, social activities and the weather. The collection was deposited in the form of 503 digital audio files, which were accessioned as .wav files in the BL’s Digital Library. Digital clones of all files are held at the Archive of Welsh English, alongside the original master recordings on 151 audio cassettes, from which the digital copies were created.
The BL’s Digital Library is mirrored on four sites – at Boston Spa, St Pancras, Aberystwyth and a ‘dark’ archive which is provided by a third party. Each of these servers has inbuilt integrity checks. The BL makes available access copies for users, in the form of .mp3 audio files, in the British Library Reading Rooms via the Soundserver system. A small set of audio extracts from the SAWD recordings are also available online on the BL’s Accents and Dialects web site, Sounds Familiar
Research data is a valuable resource, requiring a lot of time, money and effort to produce. Data often has a significant value beyond the original research. Sharing data:
Many funders have adopted research data sharing policies and require researchers to share data and outputs and journals increasingly require that the data that forms the basis for publications should be shared or deposited within an accessible database or repository.
Many researchers at the start of their career believe that the best way of handling confidential data is to destroy it. This is usually completely unnecessary and can invalidate research outputs, including theses. Even personal and confidential data can be openly shared provided researchers have taken care to observe the law and obtain the right level of consent that takes into account plans for dissemination.
Key legislation that may impact on the sharing of confidential data
In many cases, data obtained from people can be shared while upholding both the letter and the spirit of data protection and research ethics principles:
Researchers and the institutions for which they work, or where they study typically hold copyright in their data. In the case of collaborative research or derived data, copyright may be held jointly by various researchers or institutions. Secondary users of data must obtain copyright clearance from the rights holder before data can be reproduced. Data can be copied for non-commercial teaching or research purposes without infringing copyright, under the fair dealing concept, providing that the owner of the data is acknowledged. When research data is submitted to a journal, researchers need to verify whether the publisher expects copyright transfer of the data.
There are various ways to share research data, including:
Each of these ways of sharing data has advantages and disadvantages: data centres may not be able to accept all data submitted to them; institutional repositories may not be able to afford long-term maintenance of data or support for more complex research data; and websites are often ephemeral with little sustainability. Approaches to data sharing may vary according to research environments and disciplines, due to the varying nature of data types and their characteristics.
RGU is the lead partner in RiCORE - an Horizon 2020 project examining a number of aspects of the consenting process associated with the development of offshore energy installations. As part of the project, the team produced a video in which the partners discuss the aims and achievements of the project. For some sections of the video the film company took footage at one of the expert workshops and has included that footage in the video, while the sound track discusses the workshops.
The footage in question includes the name badges and employing organisation of some of the participants. All participants signed a detailed consent form (see the section on consent) which allows the use of their image. However the project manager queried whether it is acceptable for participant’s names and organisation names to be shown in the video, or whether they should be blurred. The University’s Data Protection Officer advised that it really all comes down to what the individual delegate’s expectation is, having signed the consent form in which participants were given the option to either agree, or not to their ‘identification as a contributor in reports, publications, written web material, video material, photographs and images’. Where participants have agreed, his advice was that it is not necessary to blur out details on individual badges.
It is a usual expectation that delegates at conferences who have given agreement to the use of their image are also agreeing to the organisation’s name or initials being visible too in the likes of photographs, and seminar shots. From a Data Protection perspective the University’s Data Protection Officer suggests that one needs to consider if the use of the delegate’s personal data is, ‘fair and lawful’ and in this respect having the individual’s consent ensures that this requirement is met. In his view displaying an individual’s organisations name or initials (not the delegate’s personal data) would not necessarily constitute a breach of privacy, or confidentiality again if the individual has an expectation that this is likely to be made public.
The Stockholm Environmental Institute (SEI) has created an integrated spatial database, Social and Environmental Conditions in Rural Areas (SECRA). This contains a wide range of socio-economic and environmental characteristics for all rural Census 2001 Super Output Areas (SOAs) for England. Multiple 3rd party data sources were used, such as Census 2001 data, Land Cover Map data and data from the Land Registry, Environmental Agency, Automobile Association, Royal Mail and British Trust for Ornithology. Derived data have been calculated and mapped onto SOAs.; The researchers would like to distribute the database for wider use. Whilst the database contains no original third party data, only derived data, there is still joint copyright shared between the SEI and the various copyright holders of the third party data. The researchers have sought permission from all data owners to distribute the data and the copyright of all third party data is declared in the documentation. The database can therefore be distributed.
A researcher has interviewed five retired cabinet ministers about their careers, producing audio recordings and full transcripts. The researcher then analyses the data and offers the recordings and transcripts to a data centre for preserving. However the researcher did not get signed copyright transfers for further use of the interviewees’ words. In this case it would be problematic for a data centre to accept the data. Large extracts of the data cannot be quoted by secondary users. To do so would breach the interviewees’ copyright over their recorded words. This is equally a problem for the primary researcher. The researcher should have asked for transfer of copyright or a licence to use the data obtained through interviews, as the possibility exists that the interviewee may at some point wish to assert the right over their words, e.g. when publishing memoirs
A researcher subscribes to access spatial AgCensus data from the data centre EDINA. (Edinburgh) These data are then integrated with data collected by the researcher. As part of the ESRC research award contract the data has to be offered for archiving at the UK Data Archive. Can such integrated data be offered? The subscription agreement on accessing AgCensus data states that data may not be transferred to any other person or body without prior written permission from EDINA. Therefore, the UK Data Archive cannot accept the integrated data, unless the researcher obtains permission from EDINA. The researcher’s partial data, with the AgCensus data removed, can be archived. Secondary users could then re-combine these data with the AgCensus data, if they were to obtain their own AgCensus subscription.
A researcher has collated articles about the Prime Minister from The Guardian over the past ten years, using the LexisNexis newspaper database to source articles. They are a range of socio-economic and environmental characteristics for all rural Census 2001 Super Output Areas (SOAs) for England. Multiple third party data sources were used, such as Census 2001 data, Land Cover Map data and data from the Land Registry, Environment Agency, Automobile then transcribed/copied by the researcher into a database so that content analysis can be applied. The researcher offers a copy of the database together with the original transcribed text to a data centre. Researchers cannot share either of these data sources as they do not have copyright in the original material. A data centre cannot accept these data as to do so would be breach of copyright. The rights holders, in this case The Guardian and LexisNexis, would need to provide consent for archiving.
The Publishing Network for Geoscientific and Environmental Data (PANGAEA) is an open access repository for various journals. By giving each deposited dataset a DOI, a deposited dataset acquires a unique and persistent identifier, and the underlying data can be directly connected to the corresponding article. For example, PANGAEA and the publisher Elsevier have reciprocal linking between research data deposited with PANGAEA and corresponding articles in Elsevier journals ‘Nature journals’ have a policy that requires authors to make data and materials available to readers, as a condition of publication, preferably via public repositories. Appropriate discipline-specific repositories are suggested. Specifications regarding data standards, compliance or formats may also be provided.
For example, for research on small molecule crystal structures, authors should submit the data and materials to the Cambridge Structural Database (CSD) as a Crystallographic Information File, a standard file structure for the archiving and distribution of crystallographic information. After publication of a manuscript, deposited structures are included in the CSD, from where bona fide researchers can retrieve them for free. CSD has similar deposition agreements with many other journals.
UK Biobank aims to collect medical and genetic data from 500,000 middle-aged people across the UK in order to create a research resource to study the prevention and treatment of serious diseases. Stringent security, confidentiality and anonymisation measures are in place. UK Biobank holds personal data on recruited patients, their medical records and blood, urine and genetic samples, with data made available to approved researchers. Data or samples provided to researchers never include personal identifying details.
All data and samples are stored anonymously by removing any identifying information. This identifying information is encrypted and stored separately in a restricted access database that is controlled by senior UK Biobank staff. Identifying data and samples are only linked using a code that has no external meaning. Only a few people within UK Biobank have access to the key to the code for re-linking participants’ identifying information with data and samples. All staff sign confidentiality agreements as part of their employment contracts.
The Biological Records Centre (BRC) is the national custodian of data on the distribution of wildlife in the British Isles. Data are provided by volunteers, researchers and organisations. BRC disseminates data for environmental decision-making, education and research. Data whose publication could present a significant threat to a species or habitat (e.g. nesting location of birds of prey) will be treated as confidential.
The BRC provides access to the data it holds via the National Biodiversity Network Gateway. Standard access controls are as follows:
Working with data owners, the Secure Data Service provides researchers with secure access to data that are too detailed, sensitive or confidential to be made available under the standard licences operated by its sister service, the Economic and Social Data Service (ESDS). The service’s security philosophy is based upon training and trust, leading-edge technology, licensing and legal frameworks (including the 2007 Statistics Act), and strict security policies and penalties endorsed by both the ONS and the ESRC. The technical model shares many similarities with the ONS Virtual Microdata Laboratory and the NORC Secure Data Enclave. It is based around a Citrix infrastructure which turns the end user’s computer into a remote terminal. All data processing is carried out on a central secure server; no data travels over the network. Outputs for publication are only released subject to Statistical Disclosure Control checks by trained Service staff.
Secure Data Service data cannot be downloaded. Researchers analyse the data remotely from their home institution at their desktop or in a safe room. The Service provides a ‘home away from home’ research facility with familiar statistical software and MS Office tools to make remote collaboration and analysis secure and convenient. The clearing-house mechanism established following the Convention on Biological Diversity to promote information sharing, has resulted in an exponential increase in openly accessible biodiversity and ecosystem data since 1992. The Forest Spatial Information Catalogue is a web-based portal, developed by the Center for International Forestry Research (CIFOR), for public access to spatial data and maps. The catalogue holds satellite images, aerial photographs, land usage and forest cover maps, maps of protected areas, agricultural and demographic atlases and forest boundaries. For example, forest cover maps for the entire world, produced by the World Conservation Monitoring Centre in 1997 can be downloaded freely as digital vector data.
The Global Biodiversity Information Framework (GBIF) strives to make the world’s biodiversity data accessible everywhere in the world. The framework holds millions of species occurrence records based on specimens and observations, scientific and common names and classifications of living organisms and map references for species records. Data are contributed by numerous international data providers. Geo-referenced records can be mapped to Google Earth.