CINECA
Home
History Partners Cohorts Related projects
Governance Work packages Deliverables EC Working Groups
News & Events Webinars Short videos Blog
Scientific impact Publications
Learning Pathway Synthetic Cohort Datasets Federated Analysis Platform
| Events | Contacts
Common Infrastructure for National Cohorts in Europe, Canada, and Africa
Home About CINECA History Partners Cohorts Related projects How we work Governance Work packages Deliverables EC Working Groups Dissemination News & Events Webinars Short videos Blog Impact Scientific impact Publications Virtual Platform Learning Pathway Synthetic Cohort Datasets Federated Analysis Platform
CINECA
Common Infrastructure for National Cohorts in Europe, Canada, and Africa

LexMapr - A rule-based text-mining tool for ontology term mapping and classification

This blog is the second in a series on a text-mining pipeline being developed by CINECA. For the previous instalment "Uncovering metadata from semi-structured cohort data" please click here.

The Common Infrastructure for National Cohorts in Europe, Canada, and Africa (CINECA) has a vision of a federated cloud enabled infrastructure making population scale genomic and biomolecular data accessible across international borders, accelerating research, and improving the health of individuals across continents. CINECA’s Work Package 3 (WP3) addresses the metadata representation needs for cohort aggregate and individual data across studies and over time. The CINECA text mining group in WP3 has been collaborating to provide tools and methods to harmonize data from structured, semi-structured and narrative content fields in cohorts’ data. The group has initiated development of a text mining pipeline based on the different approaches from contributing partners and initially focussing on linked cohorts. The integrated pipeline draws from the MetaMap rule-based framework (HES-SO/SIB), the LexMapr rule-based text-mining tool (SFU) and the Zooma entity annotation tool (EMBL-EBI). 


Background to the development of LexMapr 

At SFU, we have developed a rule-based text mining tool LexMapr that cleans up and parses shorter form unstructured text to extract biomedical entities and map these to standard ontology terms. To complement other text mining tools in WP3, LexMapr provides harmonization tools by focussing on the narrative contents i.e. field values of cohort data. It combines basic lexicographic transformation with Natural Language Processing (NLP), synonymy, ontology and other resource lexicons to produce a tokenized equivalent description suitable for keyword and ontology-driven search of database contents. The LexMapr pipeline addresses many challenges in the processing of short biomedical phrases. The tool was initially developed to fulfill biosample metadata harmonization objectives of public health surveillance networks like the US FDA’s GenomeTrakr system and the US National Antimicrobial Resistance Monitoring System (NARMS). Their objective is to harmonize short phrases describing food pathogen source data using standard vocabularies for reporting of transmission dynamics in public health foodborne pathogen surveillance and investigation realms. 


Cleaning and ontology term mapping

The initial focus of LexMapr development has been on providing a text-mining tool to clean up the short free-text biosample metadata that contained inconsistent punctuation, abbreviations and typos, and to map the identified entities to standard terms from ontologies. Because the problem space of biosample phrases has a very focused semantic domain of text, and short phrases pose very specific challenges to deal with, we have employed a rule-based approach that draws upon wide-ranging lexical resources. LexMapr implements different rules for pre-processing, normalization, entity recognition and ontology term mapping tasks, and makes use of domain-specific customized lexicons for abbreviation and acronyms normalization, non-English usage, and spelling correction. 

LexMapr pre-processes the input biosample descriptions by implementing a series of steps for data cleaning, punctuation and case treatment, singularization and spelling correction. The pre-processing phase improves output by providing cleaned phrases for subsequent steps in the processing for entity recognition and term mapping by LexMapr. The normalization phase transforms the entities to their normalized forms before term mapping is performed. LexMapr normalizes usage of abbreviations or acronyms and non-English language in biosample descriptions by successively applying rules on the pre-processed phrases from the previous phase. In the term mapping phase, LexMapr makes use of several rules on pre-processed and normalized samples to support the detection of relevant entities and map to ontology terms. Two key food biosample domain ontologies, FoodOn and GenEpiO that cover clinical, epidemiological and food semantics, have been selected as the target ontologies for standardizing biosamples.

The different rules have been implemented to deal with the irregular case usage, long names, naming variations and word ordering in input phrases and ontology term labels and suffix addition to input text. In case of no direct mention of the entities mentioned in the biosamples, LexMapr enables entities to map, if possible, to standard ontology terms (indirectly) by making use of synonyms. For synonym substitution, LexMapr primarily makes use of the exact synonyms for standard terms available in the selected ontologies. If not found in the ontologies, LexMapr looks for the additional synonyms in the specimen domain stored in a customized lookup-table SynLex that houses the synonyms not available in the ontologies and are candidates for curation and inclusion in the corresponding ontologies. The mapped set of terms are further refined with the ontology-driven 

pruning (using hierarchical structure of the ontologies) to retain more specific terms in the case where multiple mappings are obtained. Figure 1 shows the high-level architecture of LexMapr and its different enabling components.

Figure 1: High Level Architecture of LexMapr

Figure 1: High Level Architecture of LexMapr

Table 1 shows a snapshot of term mapping and classification results exemplifying the usage of different rules and treatments.

Specimen description Matched ontology terms with ids Rule applied IFSAC+ classification
Fish-meal fish meal:foodon_03301620 Punctuation treatment ['fish']
quail, frozen quail:foodon_03411346, frozen:pato_0001985 ['avian']
Soil soil:envo_00001998 Change of case ['environmental']
Garlic Powder garlic powder:foodon_03301844 ['root/underground (bulbs)']
stool feces:uberon_0001988 Synonym substitution ['clinical/research']
bird avian animal:foodon_00002616 ['companion animal']
Pecans pecan (whole, raw):foodon_03315232 Singularization ['nuts']
sesame seeds sesame seed:foodon_03310306 ['seeds']
Mackeral mackerel:foodon_03411043 Spelling correction treatment ['fish']
homo spaiens; Stool homo sapiens:ncbitaxon_9606, feces:uberon_0001988 ['clinical/research', 'human']
spice mix spice mixture:foodon_03304292 Abbreviation-Acronym treatment ['herbs']
frz frog legs frog leg (frozen):foodon_03305167 ['other aquatic animals']
poultry poultry meat food product:
foodon_00001131
Suffix Addition- meat food product to the Input ['poultry']
sediment, stream stream sediment:envo_00002127 Permutation of tokens in input text ['environmental']
sesame seeds, hulled sesame seed (hulled):foodon_03304876 Permutation of tokens in bracketed resource term ['seeds']
methi fenugreek food product:foodon_00001837 Non-English substitution treatment ['herbs']
tulsi powder basil food product:foodon_00003044,
food (powdered):foodon_00002976
['herbs']
frz. Fish fish (frozen):foodon_03301083 Multiple Rules:
Abbreviation-Acronym treatment, Permutation of tokens in bracketed resource term
['fish']
frz cooked shrimp shrimp (cooked, frozen):foodon_03308827 Multiple Rules:
Abbreviation-Acronym treatment, Permutation of tokens in bracketed resource term
['crustaceans']
frz. lobster tails lobster tail (frozen):foodon_03305435 Multiple Rules:
Permutation of tokens in bracketed resource term, Inflection (Plural) treatment, Abbrev. treatment
['crustaceans']

Table 1. A snapshot of term mapping and classification results obtained by LexMapr based on the application of different rules/treatments.

Ontology-Driven Classification

Once the primary task of linking free text to standard ontology terms is accomplished by LexMapr in terms of a mapping and reporting framework, it provides a platform for many potential ontology-driven applications. The tool has been equipped with a functionality to classify input phrases as per institution-specific classification schemes. LexMapr’s classification functionality has been initially used for the ontology-driven classification of biosample metadata based on the epidemiology-focused food classification scheme for categorizing foods implicated in outbreaks, Interagency Food Safety Analytics Collaboration (IFSAC), provided by GenomeTrakr and NARMS.

For the classification task, LexMapr uses predefined nodes of ontologies as buckets (containers) to characterise specific third-party classes (Figure 2). The LexMapr pipeline classification component provides functionality for biosamples to be linked to these buckets (ontology IDs) and hence to be categorized according to third-party classes (IFSAC initially). 

Figure 2: LexMapr Ontology-Driven Classification

Figure 2: LexMapr Ontology-Driven Classification

To support specific requirements of third-party classification schema, LexMapr uses multiple classification rules to further refine the preliminary classification results. For example, a post-refinement rule classifies an input phrase (macaroni and cheese) as “Multi-ingredient” (an IFSAC class) in case it contains more than one food ingredient combined together. When applied to real-world GenomeTrakr and NARMS biosample metadata, LexMapr exposed and reported the incompleteness of the existing third-party scheme in describing a variety of biosamples. Subsequent deliberations with GenomeTrakr and NARMS has led to the development of  an enhanced and improved classification scheme “IFSAC+” which has greater  coverage of the biosamples. LexMapr has enabled the reorganization of classes and introduction of many new classes in the schema, and has generated  many new candidate terms as potential terms for curation and inclusion in the ontologies.For example LexMapr has helped ontologies such as FoodOn to add new terms, and use of their synonyms has enabled the  capture of missed terms in GenomeTrakr food descriptions. Work is in progress to equip LexMapr with a mechanism for performing ontology-driven classification configured to any sort of institution-specific classification schemas provided by the user. 



Next steps

Although LexMapr has been initially developed to serve the biosample domain, our general approach of cleaning and harmonization of data can be used to address different cohort domains in the CINECA project by adding selected domain specific ontologies and rules. The tool has recently been configured to allow customized selection of ontologies and lexical resources. LexMapr is being adapted to provide a framework for automated cleaning-up of common errors or inconsistencies in the field descriptions that describe different features in cohort domains such as disease, physical environment, laboratory measures and others. We will also be enhancing the tool to facilitate  input documents of additional types, e.g. Case Report Forms, cohort text descriptions from cohort specific databases, and quality reports on samples. 


Access

The LexMapr source code is publicly available at https://github.com/Public-Health-Bioinformatics/LexMapr. LexMapr is available both as a locally installable command-line tool and via a Django-based website providing a simple graphical interface (http://watson.bccdc.med.ubc.ca:8000/lexmapr/) that is being enhanced in usability and functionality. We would like to acknowledge our funding bodies USDA (Agreement Number: 58-8040-8-014-F), CIHR (Award reference: PJT-159456), Genome British Columbia / Genome Canada (286GET).

Text-mining series, CINECA Short Reports, WP3, BlogGurinder Gosal, University of British ColumbiaOctober 8, 2020cohort, ontology, text mining, metadata, WP3
Twitter LinkedIn0 0 Likes
Previous

NanoGalaxy: Nanopore long-read sequencing data analysis in Galaxy

Publications, WP5October 17, 2020
Next

CanDIG and ELIXIR AAI interoperability demonstration - CanDIG user accessing ELIXIR service

CINECA Short Videos, WP2Guest UserSeptember 21, 2020WP2

Get In Touch

Email: info@cineca-project.eu

Keep up to date with all our latest news and events on:

QUICk Links

History
Partners
Cohorts
Scientific Impact
Events

How do we work

Work Packages
Related Projects
Synthetic Datasets
Webinars
Short training Videos

langfr-225px-Flag_of_Europe.svg_-e1550656025583 (1).png

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825775.

Canadian Institute of Health Research.png

This project has received funding from the Canadian Institute of Health Research under CIHR grant number # 404896

info@cineca-project.eu
Hours

© Copyright 2022 CINECA project.