Dataset Information

Machine learning in medicine: a practical introduction to natural language processing.

ABSTRACT:

Background

Unstructured text, including medical records, patient feedback, and social media comments, can be a rich source of data for clinical research. Natural language processing (NLP) describes a set of techniques used to convert passages of written text into interpretable datasets that can be analysed by statistical and machine learning (ML) models. The purpose of this paper is to provide a practical introduction to contemporary techniques for the analysis of text-data, using freely-available software.

Methods

We performed three NLP experiments using publicly-available data obtained from medicine review websites. First, we conducted lexicon-based sentiment analysis on open-text patient reviews of four drugs: Levothyroxine, Viagra, Oseltamivir and Apixaban. Next, we used unsupervised ML (latent Dirichlet allocation, LDA) to identify similar drugs in the dataset, based solely on their reviews. Finally, we developed three supervised ML algorithms to predict whether a drug review was associated with a positive or negative rating. These algorithms were: a regularised logistic regression, a support vector machine (SVM), and an artificial neural network (ANN). We compared the performance of these algorithms in terms of classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity and specificity.

Results

Levothyroxine and Viagra were reviewed with a higher proportion of positive sentiments than Oseltamivir and Apixaban. One of the three LDA clusters clearly represented drugs used to treat mental health problems. A common theme suggested by this cluster was drugs taking weeks or months to work. Another cluster clearly represented drugs used as contraceptives. Supervised machine learning algorithms predicted positive or negative drug ratings with classification accuracies ranging from 0.664, 95% CI [0.608, 0.716] for the regularised regression to 0.720, 95% CI [0.664,0.776] for the SVM.

Conclusions

In this paper, we present a conceptual overview of common techniques used to analyse large volumes of text, and provide reproducible code that can be readily applied to other research studies using open-source software.

SUBMITTER: Harrison CJ

PROVIDER: S-EPMC8325804 | biostudies-literature | 2021 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Machine learning in medicine: a practical introduction to natural language processing.

Harrison Conrad J CJ Sidey-Gibbons Chris J CJ

BMC medical research methodology 20210731 1

<h4>Background</h4>Unstructured text, including medical records, patient feedback, and social media comments, can be a rich source of data for clinical research. Natural language processing (NLP) describes a set of techniques used to convert passages of written text into interpretable datasets that can be analysed by statistical and machine learning (ML) models. The purpose of this paper is to provide a practical introduction to contemporary techniques for the analysis of text-data, using freely ...[more]

PMID: 34332525

Similar Datasets

Project description:BackgroundFollowing visible successes on a wide range of predictive tasks, machine learning techniques are attracting substantial interest from medical researchers and clinicians. We address the need for capacity development in this area by providing a conceptual introduction to machine learning alongside a practical guide to developing and evaluating predictive algorithms using freely-available open source software and public domain data.MethodsWe demonstrate the use of machine learning techniques by developing three predictive models for cancer diagnosis using descriptions of nuclei sampled from breast masses. These algorithms include regularized General Linear Model regression (GLMs), Support Vector Machines (SVMs) with a radial basis function kernel, and single-layer Artificial Neural Networks. The publicly-available dataset describing the breast mass samples (N=683) was randomly split into evaluation (n=456) and validation (n=227) samples. We trained algorithms on data from the evaluation sample before they were used to predict the diagnostic outcome in the validation dataset. We compared the predictions made on the validation datasets with the real-world diagnostic decisions to calculate the accuracy, sensitivity, and specificity of the three models. We explored the use of averaging and voting ensembles to improve predictive performance. We provide a step-by-step guide to developing algorithms using the open-source R statistical programming environment.ResultsThe trained algorithms were able to classify cell nuclei with high accuracy (.94 -.96), sensitivity (.97 -.99), and specificity (.85 -.94). Maximum accuracy (.96) and area under the curve (.97) was achieved using the SVM algorithm. Prediction performance increased marginally (accuracy =.97, sensitivity =.99, specificity =.95) when algorithms were arranged into a voting ensemble.ConclusionsWe use a straightforward example to demonstrate the theory and practice of machine learning for clinicians and medical researchers. The principals which we demonstrate here can be readily applied to other complex tasks including natural language processing and image recognition.

Project description:BackgroundMachine learning systems are part of the field of artificial intelligence that automatically learn models from data to make better decisions. Natural language processing (NLP), by using corpora and learning approaches, provides good performance in statistical tasks, such as text classification or sentiment mining.ObjectiveThe primary aim of this systematic review was to summarize and characterize, in methodological and technical terms, studies that used machine learning and NLP techniques for mental health. The secondary aim was to consider the potential use of these methods in mental health clinical practice.MethodsThis systematic review follows the PRISMA (Preferred Reporting Items for Systematic Review and Meta-analysis) guidelines and is registered with PROSPERO (Prospective Register of Systematic Reviews; number CRD42019107376). The search was conducted using 4 medical databases (PubMed, Scopus, ScienceDirect, and PsycINFO) with the following keywords: machine learning, data mining, psychiatry, mental health, and mental disorder. The exclusion criteria were as follows: languages other than English, anonymization process, case studies, conference papers, and reviews. No limitations on publication dates were imposed.ResultsA total of 327 articles were identified, of which 269 (82.3%) were excluded and 58 (17.7%) were included in the review. The results were organized through a qualitative perspective. Although studies had heterogeneous topics and methods, some themes emerged. Population studies could be grouped into 3 categories: patients included in medical databases, patients who came to the emergency room, and social media users. The main objectives were to extract symptoms, classify severity of illness, compare therapy effectiveness, provide psychopathological clues, and challenge the current nosography. Medical records and social media were the 2 major data sources. With regard to the methods used, preprocessing used the standard methods of NLP and unique identifier extraction dedicated to medical texts. Efficient classifiers were preferred rather than transparent functioning classifiers. Python was the most frequently used platform.ConclusionsMachine learning and NLP models have been highly topical issues in medicine in recent years and may be considered a new paradigm in medical research. However, these processes tend to confirm clinical hypotheses rather than developing entirely new information, and only one major category of the population (ie, social media users) is an imprecise cohort. Moreover, some language-specific features can improve the performance of NLP methods, and their extension to other languages should be more closely investigated. However, machine learning and NLP techniques provide useful information from unexplored data (ie, patients' daily habits that are usually inaccessible to care providers). Before considering It as an additional tool of mental health care, ethical issues remain and should be discussed in a timely manner. Machine learning and NLP methods may offer multiple perspectives in mental health research but should also be considered as tools to support clinical practice.

Project description:The lack of standardized structure names in radiotherapy (RT) data limits interoperability, data sharing, and the ability to perform big data analysis. To standardize radiotherapy structure names, we developed an integrated natural language processing (NLP) and machine learning (ML) based system that can map the physician-given structure names to American Association of Physicists in Medicine (AAPM) Task Group 263 (TG-263) standard names. The dataset consist of 794 prostate and 754 lung cancer patients across the 40 different radiation therapy centers managed by the Veterans Health Administration (VA). Additionally, data from the Radiation Oncology department at Virginia Commonwealth University (VCU) was collected to serve as a test set. Domain experts identified as anatomically significant nine prostate and ten lung organs-at-risk (OAR) structures and manually labeled them according to the TG-263 standards, and remaining structures were labeled as Non_OAR. We experimented with six different classification algorithms and three feature vector methods, and the final model was built with fastText algorithm. Multiple validation techniques are used to assess the robustness of the proposed methodology. The macro-averaged F 1 score was used as the main evaluation metric. The model achieved an F 1 score of 0.97 on prostate structures and 0.99 for lung structures from the VA dataset. The model also performed well on the test (VCU) dataset, achieving an F 1 score of 0.93 for prostate structures and 0.95 on lung structures. In this work, we demonstrate that NLP and ML based approaches can used to standardize the physician-given RT structure names with high fidelity. This standardization can help with big data analytics in the radiation therapy domain using population-derived datasets, including standardization of the treatment planning process, clinical decision support systems, treatment quality improvement programs, and hypothesis-driven clinical research.

Project description:ImportanceNonfatal gunshot injuries are the most common firearm injury, but where they frequently occur remains unclear owing to data limitations. Natural language processing can be applied to medical text narratives of gunshot injury records to classify injury location and inform prevention efforts.ObjectiveTo examine the performance of natural language processing (NLP) and machine learning models to predict nonfatal gunshot injury locations and generate new national estimates of the locations in which these injuries occur.Design, setting, and participantsCross-sectional study of data from the National Electronic Injury Surveillance System Firearm Injury Surveillance Study on nonfatal gunshot injuries that occurred in the US between January 1, 1993, and December 31, 2015. The unweighted sample included 59 025 gunshot injuries that were initially treated at emergency departments. Data were analyzed from June 1, 2019 to July 24, 2020.Main outcomes and measuresThe primary outcomes were classification of injury location and subsequent estimation of nonfatal gunshot injury location. The NLP was used to generate 6 sets of predictors, and 4 machine learning models were trained to classify the missing locations: multinomial support vector machines, lasso regression, XgBoost gradient descent, and feed-forward neural networks. For each of the 6 sets of NLP predictors, 70% of records with locations were randomly sampled to form the training set and the remaining 30% of records composed the test set. The best-performing model was validated by comparing the predicted locations were with those from existing national estimates of nonfatal gunshot injuries stratified by location and intent.ResultsThe unweighted sample included 59 025 nonfatal gunshot injuries; patients with these injuries were predominantly male (n = 52 630, [89.2%]), of Black race/ethnicity (n = 29 304 [49.6%]), and young (15-24 years; n = 27 037 [45.8%]). In total, 54 089 nonfatal gunshot injuries that were weighted to approximate national estimates were included in the analysis. Existing national estimates suggest that the most prevalent nonfatal gunshot injury location is the home (n = 14 764 [23.4%]), followed by the street or highway (n = 14 402 [22.9%]), and other public places (n = 7276 [11.6%]). After implementation of NLP classification, the most frequent gunshot injury location was the street or highway (n = 27 200 [46.1%]), followed by the home (n = 23 738 [37.7%]), and other public places (n = 10 439 [15.1%]).Conclusions and relevanceThe findings of this study suggest that NLP and machine learning models may be useful for classifying gunshot injury location and that most nonfatal gunshot injuries occur in the street or highway rather than in the home; these findings can inform future firearm injury prevention efforts.

Dataset Information

Machine learning in medicine: a practical introduction to natural language processing.

Background

Methods

Results

Conclusions

Publications

Machine learning in medicine: a practical introduction to natural language processing.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets