Dataset Information

Clinical Text Data in Machine Learning: Systematic Review.

ABSTRACT: BACKGROUND:Clinical narratives represent the main form of communication within health care, providing a personalized account of patient history and assessments, and offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data. OBJECTIVE:The main aim of this study was to provide systematic evidence on the properties of text data used to train machine learning approaches to clinical NLP. We also investigated the types of NLP tasks that have been supported by machine learning and how they can be applied in clinical practice. METHODS:Our methodology was based on the guidelines for performing systematic reviews. In August 2018, we used PubMed, a multifaceted interface, to perform a literature search against MEDLINE. We identified 110 relevant studies and extracted information about text data used to support machine learning, NLP tasks supported, and their clinical applications. The data properties considered included their size, provenance, collection methods, annotation, and any relevant statistics. RESULTS:The majority of datasets used to train machine learning models included only hundreds or thousands of documents. Only 10 studies used tens of thousands of documents, with a handful of studies utilizing more. Relatively small datasets were utilized for training even when much larger datasets were available. The main reason for such poor data utilization is the annotation bottleneck faced by supervised machine learning algorithms. Active learning was explored to iteratively sample a subset of data for manual annotation as a strategy for minimizing the annotation effort while maximizing the predictive performance of the model. Supervised learning was successfully used where clinical codes integrated with free-text notes into electronic health records were utilized as class labels. Similarly, distant supervision was used to utilize an existing knowledge base to automatically annotate raw text. Where manual annotation was unavoidable, crowdsourcing was explored, but it remains unsuitable because of the sensitive nature of data considered. Besides the small volume, training data were typically sourced from a small number of institutions, thus offering no hard evidence about the transferability of machine learning models. The majority of studies focused on text classification. Most commonly, the classification results were used to support phenotyping, prognosis, care improvement, resource management, and surveillance. CONCLUSIONS:We identified the data annotation bottleneck as one of the key obstacles to machine learning approaches in clinical NLP. Active learning and distant supervision were explored as a way of saving the annotation efforts. Future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which do not require data annotation.

SUBMITTER: Spasic I

PROVIDER: S-EPMC7157505 | biostudies-literature | 2020 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Clinical Text Data in Machine Learning: Systematic Review.

Spasic Irena I Nenadic Goran G

JMIR medical informatics 20200331 3

<h4>Background</h4>Clinical narratives represent the main form of communication within health care, providing a personalized account of patient history and assessments, and offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data.<h4>Objective</h4>The main aim of this ...[more]

PMID: 32229465

Similar Datasets

Project description:BackgroundText-based digital media platforms have revolutionized communication and information sharing, providing valuable access to knowledge and understanding in the fields of mental health and suicide prevention.ObjectiveThis systematic review aimed to determine how machine learning and data analysis can be applied to text-based digital media data to understand mental health and aid suicide prevention.MethodsA systematic review of research papers from the following major electronic databases was conducted: Web of Science, MEDLINE, Embase (via MEDLINE), and PsycINFO (via MEDLINE). The database search was supplemented by a hand search using Google Scholar.ResultsOverall, 19 studies were included, with five major themes as to how data analysis and machine learning techniques could be applied: (1) as predictors of personal mental health, (2) to understand how personal mental health and suicidal behavior are communicated, (3) to detect mental disorders and suicidal risk, (4) to identify help seeking for mental health difficulties, and (5) to determine the efficacy of interventions to support mental well-being.ConclusionsOur findings show that data analysis and machine learning can be used to gain valuable insights, such as the following: web-based conversations relating to depression vary among different ethnic groups, teenagers engage in a web-based conversation about suicide more often than adults, and people seeking support in web-based mental health communities feel better after receiving online support. Digital tools and mental health apps are being used successfully to manage mental health, particularly through the COVID-19 epidemic, during which analysis has revealed that there was increased anxiety and depression, and web-based communities played a part in reducing isolation during the pandemic. Predictive analytics were also shown to have potential, and virtual reality shows promising results in the delivery of preventive or curative care. Future research efforts could center on optimizing algorithms to enhance the potential of text-based digital media analysis in mental health and suicide prevention. In addressing depression, a crucial step involves identifying the factors that contribute to happiness and using machine learning to forecast these sources of happiness. This could extend to understanding how various activities result in improved happiness across different socioeconomic groups. Using insights gathered from such data analysis and machine learning, there is an opportunity to craft digital interventions, such as chatbots, designed to provide support and address mental health challenges and suicide prevention.

Project description:BackgroundTimely identification of patients at a high risk of clinical deterioration is key to prioritizing care, allocating resources effectively, and preventing adverse outcomes. Vital signs-based, aggregate-weighted early warning systems are commonly used to predict the risk of outcomes related to cardiorespiratory instability and sepsis, which are strong predictors of poor outcomes and mortality. Machine learning models, which can incorporate trends and capture relationships among parameters that aggregate-weighted models cannot, have recently been showing promising results.ObjectiveThis study aimed to identify, summarize, and evaluate the available research, current state of utility, and challenges with machine learning-based early warning systems using vital signs to predict the risk of physiological deterioration in acutely ill patients, across acute and ambulatory care settings.MethodsPubMed, CINAHL, Cochrane Library, Web of Science, Embase, and Google Scholar were searched for peer-reviewed, original studies with keywords related to "vital signs," "clinical deterioration," and "machine learning." Included studies used patient vital signs along with demographics and described a machine learning model for predicting an outcome in acute and ambulatory care settings. Data were extracted following PRISMA, TRIPOD, and Cochrane Collaboration guidelines.ResultsWe identified 24 peer-reviewed studies from 417 articles for inclusion; 23 studies were retrospective, while 1 was prospective in nature. Care settings included general wards, intensive care units, emergency departments, step-down units, medical assessment units, postanesthetic wards, and home care. Machine learning models including logistic regression, tree-based methods, kernel-based methods, and neural networks were most commonly used to predict the risk of deterioration. The area under the curve for models ranged from 0.57 to 0.97.ConclusionsIn studies that compared performance, reported results suggest that machine learning-based early warning systems can achieve greater accuracy than aggregate-weighted early warning systems but several areas for further research were identified. While these models have the potential to provide clinical decision support, there is a need for standardized outcome measures to allow for rigorous evaluation of performance across models. Further research needs to address the interpretability of model outputs by clinicians, clinical efficacy of these systems through prospective study design, and their potential impact in different clinical settings.

Dataset Information

Clinical Text Data in Machine Learning: Systematic Review.

Publications

Clinical Text Data in Machine Learning: Systematic Review.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets