Dataset Information

MedExtractR: A targeted, customizable approach to medication extraction from electronic health records.

ABSTRACT:

Objective

We developed medExtractR, a natural language processing system to extract medication information from clinical notes. Using a targeted approach, medExtractR focuses on individual drugs to facilitate creation of medication-specific research datasets from electronic health records.

Materials and methods

Written using the R programming language, medExtractR combines lexicon dictionaries and regular expressions to identify relevant medication entities (eg, drug name, strength, frequency). MedExtractR was developed on notes from Vanderbilt University Medical Center, using medications prescribed with varying complexity. We evaluated medExtractR and compared it with 3 existing systems: MedEx, MedXN, and CLAMP (Clinical Language Annotation, Modeling, and Processing). We also demonstrated how medExtractR can be easily tuned for better performance on an outside dataset using the MIMIC-III (Medical Information Mart for Intensive Care III) database.

Results

On 50 test notes per development drug and 110 test notes for an additional drug, medExtractR achieved high overall performance (F-measures >0.95), exceeding performance of the 3 existing systems across all drugs. MedExtractR achieved the highest F-measure for each individual entity, except drug name and dose amount for allopurinol. With tuning and customization, medExtractR achieved F-measures >0.90 in the MIMIC-III dataset.

Discussion

The medExtractR system successfully extracted entities for medications of interest. High performance in entity-level extraction provides a strong foundation for developing robust research datasets for pharmacological research. When working with new datasets, medExtractR should be tuned on a small sample of notes before being broadly applied.

Conclusions

The medExtractR system achieved high performance extracting specific medications from clinical text, leading to higher-quality research datasets for drug-related studies than some existing general-purpose medication extraction tools.

SUBMITTER: Weeks HL

PROVIDER: S-EPMC7025369 | biostudies-literature | 2020 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

medExtractR: A targeted, customizable approach to medication extraction from electronic health records.

Weeks Hannah L HL Beck Cole C McNeer Elizabeth E Williams Michael L ML Bejan Cosmin A CA Denny Joshua C JC Choi Leena L

Journal of the American Medical Informatics Association : JAMIA 20200301 3

<h4>Objective</h4>We developed medExtractR, a natural language processing system to extract medication information from clinical notes. Using a targeted approach, medExtractR focuses on individual drugs to facilitate creation of medication-specific research datasets from electronic health records.<h4>Materials and methods</h4>Written using the R programming language, medExtractR combines lexicon dictionaries and regular expressions to identify relevant medication entities (eg, drug name, strengt ...[more]

PMID: 31943012

Similar Datasets

Project description:BackgroundMulti-center electronic health records (EHR) can support quality improvement initiatives and comparative effectiveness research in stroke care. However, limitations of EHR-based research include challenges in abstracting key clinical variables from non-structured data at scale. This is further compounded by missing data. Here we develop a natural language processing (NLP) model that automatically reads EHR notes to determine the NIH stroke scale (NIHSS) score of patients with acute stroke.MethodsThe study included notes from acute stroke patients (>= 18 years) admitted to the Massachusetts General Hospital (MGH) (2015-2022). The MGH data were divided into training (70%) and hold-out test (30%) sets. A two-stage model was developed to predict the admission NIHSS. A linear model with the least absolute shrinkage and selection operator (LASSO) was trained within the training set. For notes in the test set where the NIHSS was documented, the scores were extracted using regular expressions (stage 1), for notes where NIHSS was not documented, LASSO was used for prediction (stage 2). The reference standard for NIHSS was obtained from Get With The Guidelines Stroke Registry. The two-stage model was tested on the hold-out test set and validated in the MIMIC-III dataset (Medical Information Mart for Intensive Care-MIMIC III 2001-2012) v1.4, using root mean squared error (RMSE) and Spearman correlation (SC).ResultsWe included 4,163 patients (MGH = 3,876; MIMIC = 287); average age of 69 [SD 15] years; 53% male, and 72% white. 90% patients had ischemic stroke and 10% hemorrhagic stroke. The two-stage model achieved a RMSE [95% CI] of 3.13 [2.86-3.41] (SC = 0.90 [0.88-0. 91]) in the MGH hold-out test set and 2.01 [1.58-2.38] (SC = 0.96 [0.94-0.97]) in the MIMIC validation set.ConclusionsThe automatic NLP-based model can enable large-scale stroke severity phenotyping from EHR and therefore support real-world quality improvement and comparative effectiveness studies in stroke.