Beacon cohorts: A model for cohort discovery in CINECA and beyond

This month’s blog was written by Lauren Fromont (CRG), a member of the EGA team at CRG and a member of CINECA WP1 - Federated Data Discovery and Querying. This blog is the second in our GA4GH standards series, presenting an overview of how GA4GH standards are being developed and implemented by CINECA. For the first blog in the series, giving a broad overview of how CINECA is facilitating federated data discovery, access and analysis, please see Dylan Spalding’s blog Implementation of GA4GH standards in CINECA.


Introduction

As personalised medicine is making tremendous progress globally, facilitating secure access to human data for researchers and clinicians is key to achieving this in an ethically and legally responsible manner. The CINECA project aims to promote this by delivering a federated infrastructure for data discovery of human genetic and phenotypic data, and has already assembled a virtual cohort of 1.4 million individuals from population, longitudinal and disease studies such as in CanDIG, H3Africa, and the European Genome-phenome Archive (EGA)

In February’s blog post, Dylan Spalding (EMBL-EBI) explains how the CINECA project is implementing GA4GH standards across the federated network, to allow similar analyses to be performed across cohorts. Here, we focus on the data discovery part; specifically, how we implement the Beacon cohort model, a tool that will help users get information about cohorts of interest. 

So far, Beacon v1 has been designed to find individual cases or genomic datasets. The Beacon cohort protocol allows for discovering entire cohorts, thus focusing on a specific aspect of interest for clinical researchers. Within the CINECA project, the Beacon team is making its first steps with four use-cases - synthetic datasets that have been created in order to aid the development of technical implementations. These four datasets are entirely synthesised data that has been based on the characteristics of four cohorts participating in the CINECA project: H3ABioNet, CHILD, CoLaus, and UK Biobank. As an example, the CINECA_synthetic_cohort_EUROPE_UK1 dataset is currently available here, but will form the first use-case to be made available via the Beacon v2 at EGA-CRG.


Beacon in CINECA

The Beacon project was one of the first Global Alliance for Genomics and Health (GA4GH) pilot projects. From the very beginning, it aligned well with the principles of open, distributed life sciences data services; it now stands as the “default” data discovery solution in the GA4GH portfolio, and is increasingly adopted by third-party projects, including CINECA.

While the original Beacon implementation was conceived as a protocol for sharing the presence or absence of a specific genomic mutation in a set of data, it was not designed for clinical use. In order to make Beacon more powerful and more useful in healthcare environments, a second version of Beacon (v2) includes features that aim to benefit the clinical research community (see Figure 1):

  • Allowing more informative queries, e.g. filtering by gender or age;

  • Allowing to trigger the next step in the data access process, e.g. who to contact or what are the data use conditions;

  • Jumping to another system where the data could be accessed, e.g. if the Beacon is internal to a hospital, to provide the ID of the Electronic Health Record of the patients having the mutation of interest — while preserving confidentiality;

  • Including annotations about the variants found, such as the expert/clinician conclusion about the pathogenicity of a given mutation in a given individual or its role in producing a given phenotype.

Beacon v2 is built on a default schema, to which can be added alternative schemas that fit each project’s purposes.


Figure 1. Beacon v2 API allows for more extended queries while keeping its design simple and easy to use.

Figure 1. Beacon v2 API allows for more extended queries while keeping its design simple and easy to use.


As the CINECA project is taking on the challenge to build a federated data discovery service, Beacon is a central element of making that happen: it will allow access to single variants which will be backed by access to physical samples and rich metadata about the samples. Further, CINECA provides new levels of data, sample and cohort integration to deliver a virtual cohort. This generated the need to build a model that will allow searches among cohorts.

Why do we need a Beacon cohort model?

Let’s say you are a clinical researcher in oncology. You are evaluating the efficacy of chemotherapy on specific cancer types, depending on ancestry and on the age at which the cancer was diagnosed. You spend a lot of time and resources in looking at datasets descriptions, metadata, in order to find appropriate cohorts that will help you answer your research questions. The Beacon cohort model exists as a discovery tool to facilitate that process. 

The Beacon cohort model allows researchers to find cohorts, that is, a set of individuals that satisfy one or more inclusion criteria for a duration of time (Observational Health Data Sciences and Informatics). In the simple use-case presented above, as an investigator, you might be interested in a cohort of individuals that were diagnosed with cancer before 40, and another cohort that was diagnosed with the same cancer later in life. 

After you make the query, Beacon v2 will return a description about the cohort of interest in the form of aggregated metadata that includes: 

  • Binary information about whether or not certain data types are available (e.g., genomics data, clinical measurements, lifestyle);

  • The number of individuals with each data type (e.g., 10,000 individuals with genomics data);

  • Summary statistics of individual-level phenotypic fields (e.g., gender, age, ancestry, diseases, phenotypes).

The Beacon cohorts model is included in the Beacon v2 default schema that is available for consultation; the main section describes its main components. 


How does the Beacon cohorts schema work?

Defining cohorts

Cohorts can be divided into two types: study-defined (together with Beacon-defined), and user-defined cohorts.

Back to our imaginary use-case. Imagine that you first query for individuals diagnosed with cancer before 40 years old, and there happens to be an existing study that purposely recruited this population. This is an example of a study-defined cohort, which can be either based on a pre-existing study, or implemented as such for Beacon purposes (e.g. timeline, gender, country: whichever the implementer decides to use). In that case, cohort inclusion criteria are defined by the study design (or by the Beacon implementer). For those types of cohorts, the response schema includes both study values (i.e., values that pertain to the original design) and calculated values from cohorts fields, so the user will know that the response is showing a study-defined cohort.  

Then, for your second cohort of interest, you might query individuals who were diagnosed after 40 years old, and there is no existing study that recruited this population on purpose, but there are individuals that respond to these criteria. In that case, a user-defined cohort is created on the fly by the Beacon user upon query — that is, the query itself implicitly sets the cohort criteria. Unlike study-defined cohorts, user-defined ones do not have a priori defined values; instead, they bear the values of individuals that meet the criteria set by the query. 

In both cases, Beacon cohorts are defined by fields. In the Beacon cohorts schema for CINECA, fields were selected according to the minimal cohort metadata model developed by CINECA WP3 (Cohort Level Metadata), in collaboration with cohort consortia and initiatives International HundredK+ Cohorts Consortium, and Health Data Research Innovation Gateway. This minimal metadata model takes into account the efforts from GA4GH towards standardising cohort definition. The fields selected for the CINECA Beacons were also determined by project use cases, defined by CINECA WP4 and WP5 (Federated Joint Cohort Analysis and Clinical Applications), such as Federated joint cohort variant genotyping

Finally, the Beacon cohort model is using a specific ontology to represent genomics cohort attributes: GECKO, maintained by CINECA. 


Defining cohort fields

According to the Beacon cohort model, a response to a query will provide cohort fields that will allow the user to get information about the cohort, in the form of aggregated metadata. Figure 2 defines six main blocks that characterise a given response.

Figure 2. Beacon cohort model field blocks 

Figure 2. Beacon cohort model field blocks 

These fields are particularly important because they will allow the user, in a single glance, to access extensive information about a given cohort in order to assess whether or not this cohort would suit the research question they are asking. For this reason, the Beacon cohort model relies on standards such as the GECKO ontology and GA4GH standards. Standardisation has the double advantage of being both easily usable and understandable for users, as well as making this tool transposable beyond the CINECA project: it will feature in the version 2 of Beacon for its upcoming submission for GA4GH approval. 

Conclusion: Cohorts part of Beacon v2

For Beacon v2, CINECA is an extremely interesting use-case that allows the introduction of cohorts into the Beacon schema. Cohorts will be included in the default schema, connected to individuals. Figure 3 shows a simplified illustration of how cohorts will be articulated in the default schema.

Figure 3. Cohorts in the Beacon v2 schema (simplified version)

Figure 3. Cohorts in the Beacon v2 schema (simplified version)

CINECA provided Beacon with use-cases that brought up the challenge of how to search among cohorts, either pre-defined in existing studies or defined ad hoc by a Beacon user. This significantly broadens the scope of Beacon as a GA4GH Discovery stream, from discovering cases to entire cohorts.