CINECA
Home
History Partners Cohorts Related projects
Governance Work packages Deliverables EC Working Groups
News & Events Webinars Short videos Blog
Scientific impact Publications
Learning Pathway Synthetic Cohort Datasets Federated Analysis Platform
| Events | Contacts
Common Infrastructure for National Cohorts in Europe, Canada, and Africa
Home About CINECA History Partners Cohorts Related projects How we work Governance Work packages Deliverables EC Working Groups Dissemination News & Events Webinars Short videos Blog Impact Scientific impact Publications Virtual Platform Learning Pathway Synthetic Cohort Datasets Federated Analysis Platform
CINECA
Common Infrastructure for National Cohorts in Europe, Canada, and Africa

Assigning standard descriptors to free text

This post is part of a series on a text-mining pipeline being developed by CINECA in Work Package 3. In previous instalments, first, Zooma and Curami pipelines were explained in "Uncovering metadata from semi-structured cohort data". Then, LexMapr was introduced in "LexMapr - A rule-based text-mining tool for ontology term mapping and classification". In this third instalment we are going to explain the normalisation pipeline developed at SIB/HES-SO.


Normalisation pipeline

Our text mining team has been working on a normalisation pipeline for biomedical free text. This work was primarily focused on the free text fields identified in the CoLaus/PsyCoLaus cohort. The pipeline is summarised in Figure 1. It combines MetaMap, a machine learning and rule-based tool developed by the US National Library of Medicine, with a learning to rank model to extract information from free text and map it to concepts of the UMLS Metathesaurus, a meta-ontology covering and integrating NCI Thesaurus, ICD-10 and HPO concepts, among others. The first step in the pipeline is to query MetaMap with a free text passage. MetaMap will then provide a ranked list of candidate concepts according to the input text. To improve the precision of MetaMap, the next step relies on a process of reordering this ranked list using algorithms of learning to rank. Finally, the first k candidate concepts in the re-ordered ranked list are taken as the normalised concepts for the input text.

Figure 1: Biomedical normalisation pipeline with an example

Figure 1: Biomedical normalisation pipeline with an example

Several learning to rank algorithms based on neural networks were explored, including, RankMSE, RankNet, LambdaRank, ListNet, ListMLE. According to our experiments, RankMSE achieved the best performance for the learning to rank task. We are also investigating the use of specific parameters in MetaMap that could generate a better set of candidates for the learning to rank step. 


Dataset

In order to assess the pipeline, we are using the Medical Concept Normalisation (MCN) corpus (1), where UMLS concepts are manually annotated. This corpus contains 100 discharge summaries with 3,792 unique concepts annotated including medical problems, treatments and tests. See Table 1 for an example of annotation in MCN. 

Sentence in data The patient is a 60-year-old male with a past medical history notable for coronary artery disease and CABG x2 in 2001 .
Annotation Biomedical phrase coronary artery disease
UMLS CUI C1956346

Table 1: Annotation example in MCN

Results and next steps

For the normalisation task, we achieved an accuracy of 74.21% using only MetaMap and 76.12% when we include the learning to rank process. Our next step will be to apply this pipeline to the free text fields of CoLaus/PsyCoLaus. This will enhance FAIRness of data usage by way of increased ‘Findability’, because the normalised text will make the free text data in the cohort more findable. 


Putting the Pipeline Together

Together with other specialised text normalisers developed in the project, such as LexMapr, Zooma and Curami, our approach will be exposed as web services to be consumed by cohort data owners that need to normalise free text attributes available in their data. The SIB/HES-SO text normalisation pipeline will be particularly specialised in the extraction of diagnosis concepts from text data.



Access

The code of the normalisation pipeline developed by SIB/HES-SO will become available in a public repository.



  1. Luo YF, Sun W, Rumshisky A. MCN: A Comprehensive Corpus for Medical Concept Normalization. Journal of biomedical informatics. 2019 Feb 22:103132.



Text-mining series, CINECA Short Reports, WP3, BlogJenny Copara, SIB/HES-SONovember 3, 2020text mining, zooma, metadata, cohort, ontology, WP3
Twitter LinkedIn0 0 Likes
Previous

Catalogue of Canadian, European and African ethical and legal gaps - D7.2

DeliverablesLeslie GlassNovember 30, 2020wp4
Next

NanoGalaxy: Nanopore long-read sequencing data analysis in Galaxy

Publications, WP5October 17, 2020

Get In Touch

Email: info@cineca-project.eu

Keep up to date with all our latest news and events on:

QUICk Links

History
Partners
Cohorts
Scientific Impact
Events

How do we work

Work Packages
Related Projects
Synthetic Datasets
Webinars
Short training Videos

langfr-225px-Flag_of_Europe.svg_-e1550656025583 (1).png

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825775.

Canadian Institute of Health Research.png

This project has received funding from the Canadian Institute of Health Research under CIHR grant number # 404896

info@cineca-project.eu
Hours

© Copyright 2022 CINECA project.