Access representation ontology developed for project's cohorts - D3.5

Authors: Melanie Courtot (EMBL-EBI), Jonathan Dursi (SickKids), Nicky Mulder (UCT), Morris Swertz (UMCG)

Access, reuse and integration of biomedical datasets is critical to advance genomics research and realise benefits to human health. However, obtaining human controlled-access data in a timely fashion can be challenging, as neither the access requests nor the data uses conditions are standardised: their manual review and evaluation by a Data Access Committee (DAC) to determine whether access should be granted or not can significantly delay the process, typically by at least 4 to 6 weeks once the dataset of interest has been identified.

To address this, we have contributed to the development of the Data Use Ontology (DUO), which was approved as a Global Alliance for Genomics and Health (GA4GH) standard and has been used in over 200,000 annotations worldwide. DUO is a machine readable structured vocabulary that contains "Permission terms" (which describe data use permissions) and "Modifier terms" (which describe data use requirements, limitations or prohibitions) and it has already been implemented in some CINECA cohort and cohort data sharing resources (e.g. EGA, H3Africa, synthetic datasets); additional cohorts are in the process of reviewing data access policies with a view of applying DUO terms to their datasets.

https://doi.org/10.5281/zenodo.5795449

Deliverables, WP3Melanie CourtotDecember 22, 2021deliverables, WP3

Text mining processing pipeline for semi structured data - D3.3

Authors: Jenny Copara (SIB), Nona Naderi (SIB), Alexander Kellmann (UMCG), Gurinder Gosal (SFU), William Hsiao (SFU), Douglas Teodoro (SIB)

Unstructured and semi-structured cohort data contain relevant information about the health condition of a patient, e.g., free text describing disease diagnoses, drugs, medication reasons, which are often not available in structured formats. One of the challenges posed by medical free texts is that there can be several ways of mentioning a concept. Therefore, encoding free text into unambiguous descriptors allows us to leverage the value of the cohort data, in particular, by facilitating its findability and interoperability across cohorts in the project.

Named entity recognition and normalization enable the automatic conversion of free text into standard medical concepts. Given the volume of available data shared in the CINECA project, the WP3 text mining working group has developed named entity normalization techniques to obtain standard concepts from unstructured and semi-structured fields available in the cohorts. In this deliverable, we present the methodology used to develop the different text mining tools created by the dedicated SFU, UMCG, EBI, and HES-SO/SIB groups for specific CINECA cohorts.

https://doi.org/10.5281/zenodo.5795433

Deliverables, WP3Jenny Copara, SIB/HES-SODecember 22, 2021deliverables, WP3

Overview of the FAIRsharing.org

This video provides an Overview of the FAIRsharing.org by Allyson Lister as part of the CINECA “How FAIR are you” webinar series and hackathon.

CINECA Short Videos, WP3, WP6November 22, 2021WP3, WP6, FAIR

Semantic and harmonisation best practice - D3.2

Authors - Melanie Courtot (EMBL-EBI), Isuru Liyanage (EMBL-EBI)

To support human cohort genomic and other omic data discovery and analysis across jurisdictions, basic data such as cohort participants’ demographic data, diseases, medication etc. (termed “minimal metadata”) needs to be harmonised. Individual cohorts are constrained by size, ancestral origins, and geographic boundaries that limit the subgroups, exposures, outcomes, and interactions which can be examined. Combining data across large cohorts to address questions none of them can answer alone enhances the value of each and leverages the enormous investments already made in them to address pressing questions in global health. By capturing genomic, epidemiological, clinical and environmental data from genetically and environmentally diverse populations, including populations that are traditionally under-represented, we will be able to capture novel factors associated with health and disease that are applicable to both individuals and communities globally.

We provide best practices for cohort metadata harmonisation, using the semantic platform we deployed in the cloud to enable cohort owners to map their data and harmonise against the GECKO (GEnomics Cohorts Knowledge Ontology) we developed. GECKO is derived from the CINECA minimal metadata model of the basic set of attributes that should be recorded with all cohorts and is critical to aid initial querying across jurisdictions for suitable dataset discovery. We describe how this minimal metadata model was formalised using modern semantic standards, making it interoperable with external efforts and machine readable. Furthermore, we present how those practices were successfully used at scale, both within CINECA for data discovery in WP1 and in the synthetic datasets constructed by WP3, and outside of CINECA such as in the International HundredK+ Cohorts Consortium (IHCC) and the Davos Alzheimer’s Collaborative (DAC). Finally, we highlight ongoing work for alignment with other efforts in the community and future opportunities.

https://doi.org/10.5281/zenodo.5055308

Deliverables, WP3Leslie GlassJune 15, 2021deliverables, WP3

FAIRplus FAIRification wizard

This video describes the FAIRplus fairification wizard by Fuqi Xu (EMBL-EBI) as part of the CINECA “How FAIR are you” webinar series and hackathon.

CINECA Short Videos, WP3, WP6May 17, 2021WP3, WP6, FAIR

Powering up data discovery and access using the Data Use Ontology

This month’s blog was written by Melanie Courtot, metadata standards coordinator at EMBL-EBI and co-Work Package Lead of CINECA WP3 - Cohort Level Metadata Representation. This blog is the fourth in our Global Alliance for Genomics and Health (GA4GH) standards series, presenting an overview of how GA4GH standards are being developed and implemented by CINECA. In our April post about Passport, Mikael from CINECA WP2 explained the importance of controlled-access to protect sensitive data, federated data access in the cloud and how Passport enables researchers to authenticate - prove they are who they say they are.

Implementation of GA4GH standards in CINECA

This month’s blog was written by Dylan Spalding (EMBL-EBI), Coordinator of the European Genome-phenome Archive and co-WPL of CINECA WP4 - Federated Joint Cohort Analysis. This blog is the first in our new series, presenting an overview of GA4GH standards being developed and implemented by CINECA.

GA4GH standards series, CINECA Short Reports, WP1, WP2, WP3, WP4, WP5, BlogDylan Spalding (EMBL-EBI)February 5, 2021WP1, WP2, WP3, WP4, WP5

Useful ontologies for harmonizing cohort data

This video describes a community of practice for interoperable ontology building called the OBO Foundry, and highlights a number of well curated and maintained ontologies that are useful for annotating cohort data. The video is aimed at anyone interested in data standardization and/or the ontology approach (i.e. public, end users). No prerequisite knowledge is required, but viewers may also find our previous videos useful. This video is part of the CINECA online training series, where you can learn about concepts and tools relevant to federated analysis of cohort data.

CINECA Short Videos, WP3Emma GriffithsJanuary 7, 2021WP3

Our Blog

Get In Touch

QUICk Links

How do we work