CINECA launches a training resource to help researchers learn about accessing and sharing sensitive human data across the world

To help researchers learn about finding, accessing and analysing sensitive human data in a federated fashion, the CINECA project has developed a learning pathway. In this blog, we describe why federated data analysis is important, what is included in the learning pathway and how it can be accessed.

Introduction

Researchers are generating huge amounts of human genetic data and cohort datasets around the world. These datasets contain rich information that is potentially extremely valuable to other researchers who want to reanalyse them to generate hypotheses and answer new research questions. However, these datasets have become so large that it is often difficult and costly for researchers to share them. Emerging national data protection regulations - while protecting the rights of individuals to access and protect their personal and genomic data - introduce additional barriers to sharing data for research purposes.

To overcome these challenges, new approaches are being developed that allow researchers to analyse data at its source by bringing workflows to the data. These “federated data analysis” approaches allow researchers to work within restricted jurisdictions in line with data protection regulations and remove the costs and challenges associated with transferring big datasets to different locations. Another challenge that researchers face when working with sensitive human data are the barriers around applying for access to the data, a process which can often take a lot of time. The use of “synthetic datasets”, which mimic real data but contain no identifiable information, offers a way to test workflows without ethical or legal concerns while awaiting approval to access real data.

As part of the CINECA project, a training resource was developed to help researchers learn about finding, accessing and analysing sensitive human data in a federated fashion. The learning pathway contains explanations, exercises and links to useful resources, as well as use cases to exemplify the researchers’ journey to perform a federated data analysis.

Go to the federated data analysis pathway

What is covered in the learning pathway

The journey for performing a federated data analysis on sensitive human data includes four key steps: data discovery, data access, data analysis, and data management. All of these steps can be done with synthetic data as well as with real data in cases where data access is approved for a researcher.

Data Discovery

CINECA project’s learning pathway developed and hosted on the EMBL-EBI Training portal.

The first step is discovering or finding data of interest for a particular research question or to help develop a specific research tool or software. The CINECA Learning Pathway shows how data can be found by querying popular sensitive data archives like the EGA or other repositories that store research-related artefacts like Zenodo. Synthetic datasets - because they do not contain identifiable information - can be found in multiple public repositories; while “real” data are only found in archives with controlled access.

Data Access

After a researcher finds data that meets their requirements, the next step is to obtain access. As the CINECA Learning Pathway explains for controlled access repositories, this step usually involves filling out an application requesting access, which is sent to the owner of the data for approval. Synthetic data is by definition, public and therefore does not require approval, but researchers may still need to register with the repository to make use of its services.

Data Analysis

When a researcher has accessed a dataset, they can analyse the data using different kinds of workflows and tools. The CINECA learning pathway shows two examples, one using Nextflow and one using Galaxy, and includes links to tutorials to learn about these tools before diving into the federated analysis itself.

Data Management (ELSI and FAIR)

Managing research data responsibly is a process that has to be present throughout the research cycle. It is essential to uphold the FAIR principles (Findable, Accessible, Interoperable, Reproducible) and to keep in mind the ethical, legal and societal implications (ELSI) of using and reusing research data. The CINECA learning pathway provides explanations, videos, and resources for researchers to strengthen their knowledge and skills in FAIR and ELSI best practices.

Conclusion

By using the CINECA learning pathway, researchers will be able to understand the basic concepts of finding, accessing, and analysing sensitive human data, and can feel empowered to apply these concepts alongside FAIR principles and ELSI best practices to address their research questions to maximise the value of human research data.

CINECA Short Reports, Blog, Learning Pathway, WP6Mallory Freeberg (EMBL-EBI), Daniel Thomas López (EMBL-EBI)June 12, 2023Federated Data Sharing, Access, permission, AAI, Authorisation, authorisation, learning pathway, WP6