Dataset Information

Machine learning for phenotyping opioid overdose events.

ABSTRACT:

Objective

To develop machine learning models for classifying the severity of opioid overdose events from clinical data.

Materials and methods

Opioid overdoses were identified by diagnoses codes from the Marshfield Clinic population and assigned a severity score via chart review to form a gold standard set of labels. Three primary feature sets were constructed from disparate data sources surrounding each event and used to train machine learning models for phenotyping.

Results

Random forest and penalized logistic regression models gave the best performance with cross-validated mean areas under the ROC curves (AUCs) for all severity classes of 0.893 and 0.882 respectively. Features derived from a common data model outperformed features collected from disparate data sources for the same cohort of patients (AUCs 0.893 versus 0.837, p value = 0.002). The addition of features extracted from free text to machine learning models also increased AUCs from 0.827 to 0.893 (p value < 0.0001). Key word features extracted using natural language processing (NLP) such as 'Narcan' and 'Endotracheal Tube' are important for classifying overdose event severity.

Conclusion

Random forest models using features derived from a common data model and free text can be effective for classifying opioid overdose events.

SUBMITTER: Badger J

PROVIDER: S-EPMC6622451 | biostudies-literature | 2019 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Machine learning for phenotyping opioid overdose events.

Badger Jonathan J LaRose Eric E Mayer John J Bashiri Fereshteh F Page David D Peissig Peggy P

Journal of biomedical informatics 20190425

<h4>Objective</h4>To develop machine learning models for classifying the severity of opioid overdose events from clinical data.<h4>Materials and methods</h4>Opioid overdoses were identified by diagnoses codes from the Marshfield Clinic population and assigned a severity score via chart review to form a gold standard set of labels. Three primary feature sets were constructed from disparate data sources surrounding each event and used to train machine learning models for phenotyping.<h4>Results</h ...[more]

PMID: 31028874

Similar Datasets

Project description:ImportanceCurrent approaches to identifying individuals at high risk for opioid overdose target many patients who are not truly at high risk.ObjectiveTo develop and validate a machine-learning algorithm to predict opioid overdose risk among Medicare beneficiaries with at least 1 opioid prescription.Design, setting, and participantsA prognostic study was conducted between September 1, 2017, and December 31, 2018. Participants (n = 560 057) included fee-for-service Medicare beneficiaries without cancer who filled 1 or more opioid prescriptions from January 1, 2011, to December 31, 2015. Beneficiaries were randomly and equally divided into training, testing, and validation samples.ExposuresPotential predictors (n = 268), including sociodemographics, health status, patterns of opioid use, and practitioner-level and regional-level factors, were measured in 3-month windows, starting 3 months before initiating opioids until loss of follow-up or the end of observation.Main outcomes and measuresOpioid overdose episodes from inpatient and emergency department claims were identified. Multivariate logistic regression (MLR), least absolute shrinkage and selection operator-type regression (LASSO), random forest (RF), gradient boosting machine (GBM), and deep neural network (DNN) were applied to predict overdose risk in the subsequent 3 months after initiation of treatment with prescription opioids. Prediction performance was assessed using the C statistic and other metrics (eg, sensitivity, specificity, and number needed to evaluate [NNE] to identify one overdose). The Youden index was used to identify the optimized threshold of predicted score that balanced sensitivity and specificity.ResultsBeneficiaries in the training (n = 186 686), testing (n = 186 685), and validation (n = 186 686) samples had similar characteristics (mean [SD] age of 68.0 [14.5] years, and approximately 63% were female, 82% were white, 35% had disabilities, 41% were dual eligible, and 0.60% had at least 1 overdose episode). In the validation sample, the DNN (C statistic = 0.91; 95% CI, 0.88-0.93) and GBM (C statistic = 0.90; 95% CI, 0.87-0.94) algorithms outperformed the LASSO (C statistic = 0.84; 95% CI, 0.80-0.89), RF (C statistic = 0.80; 95% CI, 0.75-0.84), and MLR (C statistic = 0.75; 95% CI, 0.69-0.80) methods for predicting opioid overdose. At the optimized sensitivity and specificity, DNN had a sensitivity of 92.3%, specificity of 75.7%, NNE of 542, positive predictive value of 0.18%, and negative predictive value of 99.9%. The DNN classified patients into low-risk (76.2% [142 180] of the cohort), medium-risk (18.6% [34 579] of the cohort), and high-risk (5.2% [9747] of the cohort) subgroups, with only 1 in 10 000 in the low-risk subgroup having an overdose episode. More than 90% of overdose episodes occurred in the high-risk and medium-risk subgroups, although positive predictive values were low, given the rare overdose outcome.Conclusions and relevanceMachine-learning algorithms appear to perform well for risk prediction and stratification of opioid overdose, especially in identifying low-risk subgroups that have minimal risk of overdose.

Project description:Advances in remote sensing combined with the emergence of sophisticated methods for large-scale data analytics from the field of data science provide new methods to model complex interactions in biological systems. Using a data-driven philosophy, insights from experts are used to corroborate the results generated through analytical models instead of leading the model design. Following such an approach, this study outlines the development and implementation of a whole-of-forest phenotyping system that incorporates spatial estimates of productivity across a large plantation forest. In large-scale plantation forestry, improving the productivity and consistency of future forests is an important but challenging goal due to the multiple interactions between biotic and abiotic factors, the long breeding cycle, and the high variability of growing conditions. Forest phenotypic expression is highly affected by the interaction of environmental conditions and forest management but the understanding of this complex dynamics is incomplete. In this study, we collected an extensive set of 2.7 million observations composed of 62 variables describing climate, forest management, tree genetics, and fine-scale terrain information extracted from environmental surfaces, management records, and remotely sensed data. Using three machine learning methods, we compared models of forest productivity and evaluate the gain and Shapley values for interpreting the influence of categorical variables on the power of these methods to predict forest productivity at a landscape level. The most accurate model identified that the most important drivers of productivity were, in order of importance, genetics, environmental conditions, leaf area index, topology, and soil properties, thus describing the complex interactions of the forest. This approach demonstrates that new methods in remote sensing and data science enable powerful, landscape-level understanding of forest productivity. The phenotyping method developed here can be used to identify superior and inferior genotypes and estimate a productivity index for individual site. This approach can improve tree breeding and deployment of the right genetics to the right site in order to increase the overall productivity across planted forests.

Project description:BackgroundIntegrating advanced machine-learning (ML) algorithms into clinical practice is challenging and requires interdisciplinary collaboration to develop transparent, interpretable, and ethically sound clinical decision support (CDS) tools. We aimed to design a ML-driven CDS tool to predict opioid overdose risk and gather feedback for its integration into the University of Florida Health (UFHealth) electronic health record (EHR) system.MethodsWe used user-centered design methods to integrate the ML algorithm into the EHR system. The backend and UI design sub-teams collaborated closely, both informed by user feedback sessions. We conducted seven user feedback sessions with five UF Health primary care physicians (PCPs) to explore aspects of CDS tools, including workflow, risk display, and risk mitigation strategies. After customizing the tool based on PCPs' feedback, we held two rounds of one-on-one usability testing sessions with 8 additional PCPs to gather feedback on prototype alerts. These sessions informed iterative UI design and backend processes, including alert frequency and reappearance circumstances.ResultsThe backend process development identified needs and requirements from our team, information technology, UFHealth, and PCPs. Thirteen PCPs (male = 62%, White = 85%) participated across 7 user feedback sessions and 8 usability testing sessions. During the user feedback sessions, PCPs (n = 5) identified flaws such as the term "high risk" of overdose potentially leading to unintended consequences (e.g., immediate addiction services referrals), offered suggestions, and expressed trust in the tool. In the first usability testing session, PCPs (n = 4) emphasized the need for natural risk presentation (e.g., 1 in 200) and suggested displaying the alert multiple times yearly for at-risk patients. Another 4 PCPs in the second usability testing session valued the UFHealth-specific alert for managing new or unfamiliar patients, expressed concerns about PCPs' workload when prescribing to high-risk patients, and recommended incorporating the details page into training sessions to enhance usability.ConclusionsThe final backend process for our CDS alert aligns with PCP needs and UFHealth standards. Integrating feedback from PCPs in the early development phase of our ML-driven CDS tool helped identify barriers and facilitators in the CDS integration process. This collaborative approach yielded a refined prototype aimed at minimizing unintended consequences and enhancing usability.

Project description:BackgroundLittle is known about whether machine-learning algorithms developed to predict opioid overdose using earlier years and from a single state will perform as well when applied to other populations. We aimed to develop a machine-learning algorithm to predict 3-month risk of opioid overdose using Pennsylvania Medicaid data and externally validated it in two data sources (ie, later years of Pennsylvania Medicaid data and data from a different state).MethodsThis prognostic modelling study developed and validated a machine-learning algorithm to predict overdose in Medicaid beneficiaries with one or more opioid prescription in Pennsylvania and Arizona, USA. To predict risk of hospital or emergency department visits for overdose in the subsequent 3 months, we measured 284 potential predictors from pharmaceutical and health-care encounter claims data in 3-month periods, starting 3 months before the first opioid prescription and continuing until loss to follow-up or study end. We developed and internally validated a gradient-boosting machine algorithm to predict overdose using 2013-16 Pennsylvania Medicaid data (n=639 693). We externally validated the model using (1) 2017-18 Pennsylvania Medicaid data (n=318 585) and (2) 2015-17 Arizona Medicaid data (n=391 959). We reported several prediction performance metrics (eg, C-statistic, positive predictive value). Beneficiaries were stratified into risk-score subgroups to support clinical use.FindingsA total of 8641 (1·35%) 2013-16 Pennsylvania Medicaid beneficiaries, 2705 (0·85%) 2017-18 Pennsylvania Medicaid beneficiaries, and 2410 (0·61%) 2015-17 Arizona beneficiaries had one or more overdose during the study period. C-statistics for the algorithm predicting 3-month overdoses developed from the 2013-16 Pennsylvania training dataset and validated on the 2013-16 Pennsylvania internal validation dataset, 2017-18 Pennsylvania external validation dataset, and 2015-17 Arizona external validation dataset were 0·841 (95% CI 0·835-0·847), 0·828 (0·822-0·834), and 0·817 (0·807-0·826), respectively. In external validation datasets, 71 361 (22·4%) of 318 585 2017-18 Pennsylvania beneficiaries were in high-risk subgroups (positive predictive value of 0·38-4·08%; capturing 73% of overdoses in the subsequent 3 months) and 40 041 (10%) of 391 959 2015-17 Arizona beneficiaries were in high-risk subgroups (positive predictive value of 0·19-1·97%; capturing 55% of overdoses). Lower risk subgroups in both validation datasets had few individuals (≤0·2%) with an overdose.InterpretationA machine-learning algorithm predicting opioid overdose derived from Pennsylvania Medicaid data performed well in external validation with more recent Pennsylvania data and with Arizona Medicaid data. The algorithm might be valuable for overdose risk prediction and stratification in Medicaid beneficiaries.FundingNational Institute of Health, National Institute on Drug Abuse, National Institute on Aging.

Project description:ObjectiveElectronic health records (EHR) offer medical and pharmacogenomics research unprecedented opportunities to identify and classify patients at risk. EHRs are collections of highly inter-dependent records that include biological, anatomical, physiological, and behavioral observations. They comprise a patient's clinical phenome, where each patient has thousands of date-stamped records distributed across many relational tables. Development of EHR computer-based phenotyping algorithms require time and medical insight from clinical experts, who most often can only review a small patient subset representative of the total EHR records, to identify phenotype features. In this research we evaluate whether relational machine learning (ML) using inductive logic programming (ILP) can contribute to addressing these issues as a viable approach for EHR-based phenotyping.MethodsTwo relational learning ILP approaches and three well-known WEKA (Waikato Environment for Knowledge Analysis) implementations of non-relational approaches (PART, J48, and JRIP) were used to develop models for nine phenotypes. International Classification of Diseases, Ninth Revision (ICD-9) coded EHR data were used to select training cohorts for the development of each phenotypic model. Accuracy, precision, recall, F-Measure, and Area Under the Receiver Operating Characteristic (AUROC) curve statistics were measured for each phenotypic model based on independent manually verified test cohorts. A two-sided binomial distribution test (sign test) compared the five ML approaches across phenotypes for statistical significance.ResultsWe developed an approach to automatically label training examples using ICD-9 diagnosis codes for the ML approaches being evaluated. Nine phenotypic models for each ML approach were evaluated, resulting in better overall model performance in AUROC using ILP when compared to PART (p=0.039), J48 (p=0.003) and JRIP (p=0.003).DiscussionILP has the potential to improve phenotyping by independently delivering clinically expert interpretable rules for phenotype definitions, or intuitive phenotypes to assist experts.ConclusionRelational learning using ILP offers a viable approach to EHR-driven phenotyping.

Project description:During the last decade, there has been rapid adoption of ground and aerial platforms with multiple sensors for phenotyping various biotic and abiotic stresses throughout the developmental stages of the crop plant. High throughput phenotyping (HTP) involves the application of these tools to phenotype the plants and can vary from ground-based imaging to aerial phenotyping to remote sensing. Adoption of these HTP tools has tried to reduce the phenotyping bottleneck in breeding programs and help to increase the pace of genetic gain. More specifically, several root phenotyping tools are discussed to study the plant's hidden half and an area long neglected. However, the use of these HTP technologies produces big data sets that impede the inference from those datasets. Machine learning and deep learning provide an alternative opportunity for the extraction of useful information for making conclusions. These are interdisciplinary approaches for data analysis using probability, statistics, classification, regression, decision theory, data visualization, and neural networks to relate information extracted with the phenotypes obtained. These techniques use feature extraction, identification, classification, and prediction criteria to identify pertinent data for use in plant breeding and pathology activities. This review focuses on the recent findings where machine learning and deep learning approaches have been used for plant stress phenotyping with data being collected using various HTP platforms. We have provided a comprehensive overview of different machine learning and deep learning tools available with their potential advantages and pitfalls. Overall, this review provides an avenue for studying various HTP platforms with particular emphasis on using the machine learning and deep learning tools for drawing legitimate conclusions. Finally, we propose the conceptual challenges being faced and provide insights on future perspectives for managing those issues.

Project description:BACKGROUND:Timely data is key to effective public health responses to epidemics. Drug overdose deaths are identified in surveillance systems through ICD-10 codes present on death certificates. ICD-10 coding takes time, but free-text information is available on death certificates prior to ICD-10 coding. The objective of this study was to develop a machine learning method to classify free-text death certificates as drug overdoses to provide faster drug overdose mortality surveillance. METHODS:Using 2017-2018 Kentucky death certificate data, free-text fields were tokenized and features were created from these tokens using natural language processing (NLP). Word, bigram, and trigram features were created as well as features indicating the part-of-speech of each word. These features were then used to train machine learning classifiers on 2017 data. The resulting models were tested on 2018 Kentucky data and compared to a simple rule-based classification approach. Documented code for this method is available for reuse and extensions: https://github.com/pjward5656/dcnlp. RESULTS:The top scoring machine learning model achieved 0.96 positive predictive value (PPV) and 0.98 sensitivity for an F-score of 0.97 in identification of fatal drug overdoses on test data. This machine learning model achieved significantly higher performance for sensitivity (p<0.001) than the rule-based approach. Additional feature engineering may improve the model's prediction. This model can be deployed on death certificates as soon as the free-text is available, eliminating the time needed to code the death certificates. CONCLUSION:Machine learning using natural language processing is a relatively new approach in the context of surveillance of health conditions. This method presents an accessible application of machine learning that improves the timeliness of drug overdose mortality surveillance. As such, it can be employed to inform public health responses to the drug overdose epidemic in near-real time as opposed to several weeks following events.

Dataset Information

Machine learning for phenotyping opioid overdose events.

Objective

Materials and methods

Results

Conclusion

Publications

Machine learning for phenotyping opioid overdose events.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets