CINECA
Home
History Partners Cohorts Related projects
Governance Work packages Deliverables EC Working Groups
News & Events Webinars Short videos Blog
Scientific impact Publications
Learning Pathway Synthetic Cohort Datasets Federated Analysis Platform
| Events | Contacts
Common Infrastructure for National Cohorts in Europe, Canada, and Africa
Home About CINECA History Partners Cohorts Related projects How we work Governance Work packages Deliverables EC Working Groups Dissemination News & Events Webinars Short videos Blog Impact Scientific impact Publications Virtual Platform Learning Pathway Synthetic Cohort Datasets Federated Analysis Platform
CINECA
Common Infrastructure for National Cohorts in Europe, Canada, and Africa

Uncovering metadata from semi-structured cohort data

Harmonisation of attributes across different cohorts is very challenging and labour intensive, but critical to leverage the collective potential of the data. While this problem is common across different domains, their individual needs and requirements make it difficult to employ a common set of tools or to directly apply standard machine learning techniques. For example, besides well  structured metadata, cohorts are also described in scientific publications, and supplementary sections of research papers often contain large amounts of semi-structured data. This supplementary data could contain additional information which could both reaffirm and enhance original data and thus would be very helpful in search and data discovery. Due to its nature and scale, this data is not amenable to manual processing, and instead requires application of machine learning and Natural Language Processing (NLP) techniques to extract any useful information. 

The CINECA text mining group aims to provide common tools and methods to extract additional metadata from structured and semi-structured fields in cohorts’ data.


Text mining team

The text-mining team encompasses members from CINECA participant institutions such as SFU, EMBL-EBI, HES-SO, and UCT. The teams are initially focusing on building tools based on their prior experience/knowledge targeting specific cohorts. The goal is to expose each of them as (micro)services, which will then be made available via a collective interface for general cohort text analysis related tasks.

Focusing on the CoLaus/PsyCoLaus cohort data, HES-SO/SIB has developed a pipeline using MetaMap, a machine learning and rule based framework for assigning unambiguous metadata descriptors to free text. SFU has developed a rule-based text mining tool LexMapr that cleans up and parses shorter form unstructured text to extract biomedical entities and map these to standard ontology terms. EMBL-EBI has developed Zooma and Curami pipelines to annotate and curate semi-structured data, which we describe further below.

Zooma pipeline

Zooma is an ontology annotation tool developed at EMBL-EBI for mapping free text to ontology terms. Zooma is backed by a linked data repository of annotation knowledge which contains curated annotations derived from many publicly available data sources such as Expression Atlas, Open Targets and GWAS Catalog. Therefore Zooma can facilitate annotations relating to a diverse range of topics including disease and phenotypes, drug treatments, anatomical components, species, cell types, etc. Furthermore Zooma can be easily configured to use new data sources or prioritise certain data sources over others to enhance context sensitivity. To increase FAIRness (1) of its data, BioSamples has developed a pipeline to automatically annotate sample attribute-values with ontologies using Zooma. This allows researchers to do complex queries using ontology expansion and synonyms - for example, searching for heart diseases will return samples annotated with myocardial infarction using ontology expansion.

Figure 1: Ontology annotation of Sample attribute-value using Zooma

Figure 1: Ontology annotation of Sample attribute-value using Zooma

Curami pipeline

Curami is a machine assisted curation tool for harmonising data in the EBI BioSamples database. Curami aims to both identify erroneous attributes in the dataset and find a suitable replacement for it. Attribute pairing strategy is based on the assumption that, in most cases, the replacement for an erroneous attribute can be found within the dataset. After the text normalisation process Curami applies different NLP techniques to identify semantically and syntactically similar pairs of attributes. Filtered attribute pairs are loaded into neo4j graph database for identifying clusters of similar attributes. Sorted pairs based on curation impact are manually curated before feeding back to the BioSamples database. In the future we expect to use these curation results for machine learning analysis.

Figure 2: Neo4j cluster for lexically similar pairs

Figure 2: Neo4j cluster for lexically similar pairs

Conclusion

Using Zooma we have been able to add 15 million ontology annotations onto the 14 million public samples that BioSamples contains as of September 2020.  We will apply Zooma for annotating cohort data with existing and new ontologies developed as a part of CINECA project to streamline mapping to the CINECA metadata model and improve data integration. Curami has allowed for automated curation of in excess of 40 million attributes. We recently deployed a curami-based recommendation service to improve the comprehensiveness of the metadata at submission time. Utilised together, Zooma and Curami increase data FAIRness at scale both on existing data and for future submissions.

Text-mining series, CINECA Short Reports, WP3, BlogIsuru LiyanageSeptember 15, 2020text mining, metadata, ontology, cohort, WP3
Twitter LinkedIn0 0 Likes
Previous

Annotating data using next generation biobanking ontology (NGBO)

CINECA Short VideosGuest UserSeptember 16, 2020training materials, WP3
Next

Annotating data using ontologies

CINECA Short VideosEmma GriffithsSeptember 10, 2020

Get In Touch

Email: info@cineca-project.eu

Keep up to date with all our latest news and events on:

QUICk Links

History
Partners
Cohorts
Scientific Impact
Events

How do we work

Work Packages
Related Projects
Synthetic Datasets
Webinars
Short training Videos

langfr-225px-Flag_of_Europe.svg_-e1550656025583 (1).png

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825775.

Canadian Institute of Health Research.png

This project has received funding from the Canadian Institute of Health Research under CIHR grant number # 404896

info@cineca-project.eu
Hours

© Copyright 2022 CINECA project.