Dataset Information

Development and Validation of a Machine Learning Model Using Administrative Health Data to Predict Onset of Type 2 Diabetes.

ABSTRACT:

Importance

Systems-level barriers to diabetes care could be improved with population health planning tools that accurately discriminate between high- and low-risk groups to guide investments and targeted interventions.

Objective

To develop and validate a population-level machine learning model for predicting type 2 diabetes 5 years before diabetes onset using administrative health data.

Design, setting, and participants

This decision analytical model study used linked administrative health data from the diverse, single-payer health system in Ontario, Canada, between January 1, 2006, and December 31, 2016. A gradient boosting decision tree model was trained on data from 1 657 395 patients, validated on 243 442 patients, and tested on 236 506 patients. Costs associated with each patient were estimated using a validated costing algorithm. Data were analyzed from January 1, 2006, to December 31, 2016.

Exposures

A random sample of 2 137 343 residents of Ontario without type 2 diabetes was obtained at study start time. More than 300 features from data sets capturing demographic information, laboratory measurements, drug benefits, health care system interactions, social determinants of health, and ambulatory care and hospitalization records were compiled over 2-year patient medical histories to generate quarterly predictions.

Main outcomes and measures

Discrimination was assessed using the area under the receiver operating characteristic curve statistic, and calibration was assessed visually using calibration plots. Feature contribution was assessed with Shapley values. Costs were estimated in 2020 US dollars.

Results

This study trained a gradient boosting decision tree model on data from 1 657 395 patients (12 900 257 instances; 6 666 662 women [51.7%]). The developed model achieved a test area under the curve of 80.26 (range, 80.21-80.29), demonstrated good calibration, and was robust to sex, immigration status, area-level marginalization with regard to material deprivation and race/ethnicity, and low contact with the health care system. The top 5% of patients predicted as high risk by the model represented 26% of the total annual diabetes cost in Ontario.

Conclusions and relevance

In this decision analytical model study, a machine learning model approach accurately predicted the incidence of diabetes in the population using routinely collected health administrative data. These results suggest that the model could be used to inform decision-making for population health planning and diabetes prevention.

SUBMITTER: Ravaut M

PROVIDER: S-EPMC8150694 | biostudies-literature | 2021 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Development and Validation of a Machine Learning Model Using Administrative Health Data to Predict Onset of Type 2 Diabetes.

Ravaut Mathieu M Harish Vinyas V Sadeghi Hamed H Leung Kin Kwan KK Volkovs Maksims M Kornas Kathy K Watson Tristan T Poutanen Tomi T Rosella Laura C LC

JAMA network open 20210503 5

<h4>Importance</h4>Systems-level barriers to diabetes care could be improved with population health planning tools that accurately discriminate between high- and low-risk groups to guide investments and targeted interventions.<h4>Objective</h4>To develop and validate a population-level machine learning model for predicting type 2 diabetes 5 years before diabetes onset using administrative health data.<h4>Design, setting, and participants</h4>This decision analytical model study used linked admin ...[more]

PMID: 34032855

Similar Datasets

Project description:BackgroundEchocardiography (echo) based machine learning (ML) models may be useful in identifying patients at high-risk of all-cause mortality.MethodsWe developed ML models (ResNet deep learning using echo videos and CatBoost gradient boosting using echo measurements) to predict 1-year, 3-year, and 5-year mortality. Models were trained on the Mackay dataset, Taiwan (6083 echos, 3626 patients) and validated in the Alberta HEART dataset, Canada (997 echos, 595 patients). We examined the performance of the models overall, and in subgroups (healthy controls, at risk of heart failure (HF), HF with reduced ejection fraction (HFrEF) and HF with preserved ejection fraction (HFpEF)). We compared the models' performance to the MAGGIC risk score, and examined the correlation between the models' predicted probability of death and baseline quality of life as measured by the Kansas City Cardiomyopathy Questionnaire (KCCQ).FindingsMortality rates at 1-, 3- and 5-years were 14.9%, 28.6%, and 42.5% in the Mackay cohort, and 3.0%, 10.3%, and 18.7%, in the Alberta HEART cohort. The ResNet and CatBoost models achieved area under the receiver-operating curve (AUROC) between 85% and 92% in internal validation. In external validation, the AUROCs for the ResNet (82%, 82%, and 78%) were significantly better than CatBoost (78%, 73%, and 75%), for 1-, 3- and 5-year mortality prediction respectively, with better or comparable performance to the MAGGIC score. ResNet models predicted higher probability of death in the HFpEF and HFrEF (30%-50%) subgroups than in controls and at risk patients (5%-20%). The predicted probabilities of death correlated with KCCQ scores (all p < 0.05).InterpretationEcho-based ML models to predict mortality had good internal and external validity, were generalizable, correlated with patients' quality of life, and are comparable to an established HF risk score. These models can be leveraged for automated risk stratification at point-of-care.FundingFunding for Alberta HEART was provided by an Alberta Innovates - Health Solutions Interdisciplinary Team Grant no. AHFMRITG 200801018. P.K. holds a Canadian Institutes of Health Research (CIHR) Sex and Gender Science Chair and a Heart & Stroke Foundation Chair in Cardiovascular Research. A.V. and V.S. received funding from the Mitacs Globalink Research Internship.

Project description:BackgroundHealth coaching is an emerging intervention that has been shown to improve clinical and patient-relevant outcomes for type 2 diabetes. Advances in artificial intelligence may provide an avenue for developing a more personalized, adaptive, and cost-effective approach to diabetes health coaching.ObjectiveWe aim to apply Q-learning, a widely used reinforcement learning algorithm, to a diabetes health-coaching data set to develop a model for recommending an optimal coaching intervention at each decision point that is tailored to a patient's accumulated history.MethodsIn this pilot study, we fit a two-stage reinforcement learning model on 177 patients from the intervention arm of a community-based randomized controlled trial conducted in Canada. The policy produced by the reinforcement learning model can recommend a coaching intervention at each decision point that is tailored to a patient's accumulated history and is expected to maximize the composite clinical outcome of hemoglobin A1c reduction and quality of life improvement (normalized to [ 0, 1 ], with a higher score being better). Our data, models, and source code are publicly available.ResultsAmong the 177 patients, the coaching intervention recommended by our policy mirrored the observed diabetes health coach's interventions in 17.5% (n=31) of the patients in stage 1 and 14.1% (n=25) of the patients in stage 2. Where there was agreement in both stages, the average cumulative composite outcome (0.839, 95% CI 0.460-1.220) was better than those for whom the optimal policy agreed with the diabetes health coach in only one stage (0.791, 95% CI 0.747-0.836) or differed in both stages (0.755, 95% CI 0.728-0.781). Additionally, the average cumulative composite outcome predicted for the policy's recommendations was significantly better than that of the observed diabetes health coach's recommendations (tn-1=10.040; P<.001).ConclusionsApplying reinforcement learning to diabetes health coaching could allow for both the automation of health coaching and an improvement in health outcomes produced by this type of intervention.

Project description:BackgroundAlthough prior research has identified multiple risk factors for diabetic ketoacidosis (DKA), clinicians continue to lack clinic-ready models to predict dangerous and costly episodes of DKA. We asked whether we could apply deep learning, specifically the use of a long short-term memory (LSTM) model, to accurately predict the 180-day risk of DKA-related hospitalization for youth with type 1 diabetes (T1D).ObjectiveWe aimed to describe the development of an LSTM model to predict the 180-day risk of DKA-related hospitalization for youth with T1D.MethodsWe used 17 consecutive calendar quarters of clinical data (January 10, 2016, to March 18, 2020) for 1745 youths aged 8 to 18 years with T1D from a pediatric diabetes clinic network in the Midwestern United States. The input data included demographics, discrete clinical observations (laboratory results, vital signs, anthropometric measures, diagnosis, and procedure codes), medications, visit counts by type of encounter, number of historic DKA episodes, number of days since last DKA admission, patient-reported outcomes (answers to clinic intake questions), and data features derived from diabetes- and nondiabetes-related clinical notes via natural language processing. We trained the model using input data from quarters 1 to 7 (n=1377), validated it using input from quarters 3 to 9 in a partial out-of-sample (OOS-P; n=1505) cohort, and further validated it in a full out-of-sample (OOS-F; n=354) cohort with input from quarters 10 to 15.ResultsDKA admissions occurred at a rate of 5% per 180-days in both out-of-sample cohorts. In the OOS-P and OOS-F cohorts, the median age was 13.7 (IQR 11.3-15.8) years and 13.1 (IQR 10.7-15.5) years; median glycated hemoglobin levels at enrollment were 8.6% (IQR 7.6%-9.8%) and 8.1% (IQR 6.9%-9.5%); recall was 33% (26/80) and 50% (9/18) for the top-ranked 5% of youth with T1D; and 14.15% (213/1505) and 12.7% (45/354) had prior DKA admissions (after the T1D diagnosis), respectively. For lists rank ordered by the probability of hospitalization, precision increased from 33% to 56% to 100% for positions 1 to 80, 1 to 25, and 1 to 10 in the OOS-P cohort and from 50% to 60% to 80% for positions 1 to 18, 1 to 10, and 1 to 5 in the OOS-F cohort, respectively.ConclusionsThe proposed LSTM model for predicting 180-day DKA-related hospitalization was valid in this sample. Future research should evaluate model validity in multiple populations and settings to account for health inequities that may be present in different segments of the population (eg, racially or socioeconomically diverse cohorts). Rank ordering youth by probability of DKA-related hospitalization will allow clinics to identify the most at-risk youth. The clinical implication of this is that clinics may then create and evaluate novel preventive interventions based on available resources.

Dataset Information

Development and Validation of a Machine Learning Model Using Administrative Health Data to Predict Onset of Type 2 Diabetes.

Importance

Objective

Design, setting, and participants

Exposures

Main outcomes and measures

Results

Conclusions and relevance

Publications

Development and Validation of a Machine Learning Model Using Administrative Health Data to Predict Onset of Type 2 Diabetes.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets