Dataset Information

Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples.

ABSTRACT:

Background

Unstructured data from clinical epidemiological studies can be valuable and easy to obtain. However, it requires further extraction and processing for data analysis. Doing this manually is labor-intensive, slow and subject to error. In this study, we propose an automation framework for extracting and processing unstructured data.

Methods

The proposed automation framework consisted of two natural language processing (NLP) based tools for unstructured text data for medications and reasons for medication use. We first checked spelling using a spell-check program trained on publicly available knowledge sources and then applied NLP techniques. We mapped medication names into generic names using vocabulary from publicly available knowledge sources. We used WHO's Anatomical Therapeutic Chemical (ATC) classification system to map generic medication names to medication classes. We processed the reasons for medication with the Lancaster stemmer method and then grouped and mapped to disease classes based on organ systems. Finally, we demonstrated this automation framework on two data sources for Mylagic Encephalomyelitis/ Chronic Fatigue Syndrome (ME/CFS): tertiary-based (n?=?378) and population-based (n?=?664) samples.

Results

A total of 8681 raw medication records were used for this demonstration. The 1266 distinct medication names (omitting supplements) were condensed to 89 ATC classification system categories. The 1432 distinct raw reasons for medication use were condensed to 65 categories via NLP. Compared to completion of the entire process manually, our automation process reduced the number of the terms requiring manual labor for mapping by 84.4% for medications and 59.4% for reasons for medication use. Additionally, this process improved the precision of the mapped results.

Conclusions

Our automation framework demonstrates the usefulness of NLP strategies even when there is no established mapping database. For a less established database (e.g., reasons for medication use), the method is easily modifiable as new knowledge sources for mapping are introduced. The capability to condense large features into interpretable ones will be valuable for subsequent analytical studies involving techniques such as machine learning and data mining.

SUBMITTER: Chen R

PROVIDER: S-EPMC7559204 | biostudies-literature | 2020 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples.

Chen Robert R Ho Joyce C JC Lin Jin-Mann S JS

BMC medical research methodology 20201015 1

<h4>Background</h4>Unstructured data from clinical epidemiological studies can be valuable and easy to obtain. However, it requires further extraction and processing for data analysis. Doing this manually is labor-intensive, slow and subject to error. In this study, we propose an automation framework for extracting and processing unstructured data.<h4>Methods</h4>The proposed automation framework consisted of two natural language processing (NLP) based tools for unstructured text data for medica ...[more]

PMID: 33059588

Similar Datasets

Project description:BackgroundMerging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as natural language processing (NLP), has been researched at least for the English language extensively, it is not enough to get a structured output in any format. NLP techniques need to be used together with clinical information standards such as openEHR to be able to reuse and exchange still unstructured data sensibly.ObjectivesThe aim of the study is to automatically extract crucial information from medical free texts and to transform this unstructured clinical data into a standardized and structured representation by designing and implementing an exemplary pipeline for the processing of pediatric medical histories.MethodsWe constructed a pipeline that allows reusing medical free texts such as pediatric medical histories in a structured and standardized way by (1) selecting and modeling appropriate openEHR archetypes as standard clinical information models, (2) defining a German dictionary with crucial text markers serving as expert knowledge base for a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes. The approach was evaluated in a first pilot study by using 50 manually annotated medical histories from the pediatric intensive care unit of the Hannover Medical School.ResultsWe successfully reused 24 existing international archetypes to represent the most crucial elements of unstructured pediatric medical histories in a standardized form. The self-developed NLP pipeline was constructed by defining 3.055 text marker entries, 132 text events, 66 regular expressions, and a text corpus consisting of 776 entries for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented to transform the extracted snippets to an openEHR-based representation to be able to store them together with other structured data in an existing openEHR-based data repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94% recall.ConclusionThe use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting and representing important information from pediatric medical histories in a structured and semantically enriched format. We designed a promising approach with potential to be generalized, and implemented a prototype that is extensible and reusable for other use cases concerning German medical free texts. In a long term, this will harness unstructured clinical data for further research purposes such as the design of clinical decision support systems. Together with structured data already integrated in openEHR-based representations, we aim at developing an interoperable openEHR-based application that is capable of automatically assessing a patient's risk status based on the patient's medical history at time of admission.

Project description:Background:Within the global endeavour of improving population health, one major challenge is the identification and integration of medical knowledge spread through several information sources. The creation of a comprehensive dataset of diseases and their clinical manifestations based on information from public sources is an interesting approach that allows one not only to complement and merge medical knowledge but also to increase it and thereby to interconnect existing data and analyse and relate diseases to each other. In this paper, we present DISNET (http://disnet.ctb.upm.es/), a web-based system designed to periodically extract the knowledge from signs and symptoms retrieved from medical databases, and to enable the creation of customisable disease networks. Methods:We here present the main features of the DISNET system. We describe how information on diseases and their phenotypic manifestations is extracted from Wikipedia and PubMed websites; specifically, texts from these sources are processed through a combination of text mining and natural language processing techniques. Results:We further present the validation of our system on Wikipedia and PubMed texts, obtaining the relevant accuracy. The final output includes the creation of a comprehensive symptoms-disease dataset, shared (free access) through the system's API. We finally describe, with some simple use cases, how a user can interact with it and extract information that could be used for subsequent analyses. Discussion:DISNET allows retrieving knowledge about the signs, symptoms and diagnostic tests associated with a disease. It is not limited to a specific category (all the categories that the selected sources of information offer us) and clinical diagnosis terms. It further allows to track the evolution of those terms through time, being thus an opportunity to analyse and observe the progress of human knowledge on diseases. We further discussed the validation of the system, suggesting that it is good enough to be used to extract diseases and diagnostically-relevant terms. At the same time, the evaluation also revealed that improvements could be introduced to enhance the system's reliability.

Dataset Information

Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples.

Background

Methods

Results

Conclusions

Publications

Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets