Dataset Information

Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers.

ABSTRACT: Diabetes mellitus is a group of metabolic diseases in which blood sugar levels are too high. About 8.8% of the world was diabetic in 2017. It is projected that this will reach nearly 10% by 2045. The major challenge is that when machine learning-based classifiers are applied to such data sets for risk stratification, leads to lower performance. Thus, our objective is to develop an optimized and robust machine learning (ML) system under the assumption that missing values or outliers if replaced by a median configuration will yield higher risk stratification accuracy. This ML-based risk stratification is designed, optimized and evaluated, where: (i) the features are extracted and optimized from the six feature selection techniques (random forest, logistic regression, mutual information, principal component analysis, analysis of variance, and Fisher discriminant ratio) and combined with ten different types of classifiers (linear discriminant analysis, quadratic discriminant analysis, naïve Bayes, Gaussian process classification, support vector machine, artificial neural network, Adaboost, logistic regression, decision tree, and random forest) under the hypothesis that both missing values and outliers when replaced by computed medians will improve the risk stratification accuracy. Pima Indian diabetic dataset (768 patients: 268 diabetic and 500 controls) was used. Our results demonstrate that on replacing the missing values and outliers by group median and median values, respectively and further using the combination of random forest feature selection and random forest classification technique yields an accuracy, sensitivity, specificity, positive predictive value, negative predictive value and area under the curve as: 92.26%, 95.96%, 79.72%, 91.14%, 91.20%, and 0.93, respectively. This is an improvement of 10% over previously developed techniques published in literature. The system was validated for its stability and reliability. RF-based model showed the best performance when outliers are replaced by median values.

SUBMITTER: Maniruzzaman M

PROVIDER: S-EPMC5893681 | biostudies-other | 2018 Apr

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers.

Maniruzzaman Md M Rahman Md Jahanur MJ Al-MehediHasan Md M Suri Harman S HS Abedin Md Menhazul MM El-Baz Ayman A Suri Jasjit S JS

Journal of medical systems 20180410 5

Diabetes mellitus is a group of metabolic diseases in which blood sugar levels are too high. About 8.8% of the world was diabetic in 2017. It is projected that this will reach nearly 10% by 2045. The major challenge is that when machine learning-based classifiers are applied to such data sets for risk stratification, leads to lower performance. Thus, our objective is to develop an optimized and robust machine learning (ML) system under the assumption that missing values or outliers if replaced b ...[more]

PMID: 29637403

Similar Datasets

Project description:BackgroundPapillary thyroid cancer (PTC) is one of the most common endocrine malignancies with different risk levels. However, preoperative risk assessment of PTC is still a challenge in the worldwide. Here, we first report a Preoperative Risk Assessment Classifier for PTC (PRAC-PTC) by multidimensional features including clinical indicators, immune indices, genetic feature, and proteomics.Materials and methodsThe 558 patients collected from June 2013 to November 2020 were allocated to three groups: discovery set (274 patients, 274 FFPE), retrospective test set (166 patients, 166 FFPE) and prospective test set (118 patients, 118 FNA). Proteomic profiling was conducted by formalin-fixed paraffin-embedded (FFPE) and fine-needle aspiration (FNA) tissues from the patients. Preoperative clinical information and blood immunological indices were collected. The BRAFV600E mutation were detected by the amplification refractory mutation system (ARMS).ResultsWe developed a machine learning model of 17 variables based on multidimensional features of 274 PTC patients from a retrospective cohort. The PRAC-PTC achieved areas under the curve (AUC) of 0.925 in the discovery set and validated externally by blinded analyses in a retrospective cohort of 166 PTC patients (0.787 AUC) and a prospective cohort of 118 PTC patients (0.799 AUC) from two independent clinical centres. Meanwhile, the preoperative predictive risk effectiveness of clinicians was improved with the assistance of PRAC-PTC, and the accuracies reached at 84.4% (95% CI 82.9-84.4) and 83.5% (95% CI 82.2-84.2) in the retrospective and prospective test sets, respectively.ConclusionThis study demonstrated that the PRAC-PTC that integrating clinical data, gene mutation information, immune indices, high-throughput proteomics and machine learning technology in multi-centre retrospective and prospective clinical cohorts can effectively stratify the preoperative risk of PTC and may decrease unnecessary surgery or overtreatment.

Project description:BACKGROUND AND PURPOSE:MR imaging-based modeling of tumor cell density can substantially improve targeted treatment of glioblastoma. Unfortunately, interpatient variability limits the predictive ability of many modeling approaches. We present a transfer learning method that generates individualized patient models, grounded in the wealth of population data, while also detecting and adjusting for interpatient variabilities based on each patient's own histologic data. MATERIALS AND METHODS:We recruited patients with primary glioblastoma undergoing image-guided biopsies and preoperative imaging, including contrast-enhanced MR imaging, dynamic susceptibility contrast MR imaging, and diffusion tensor imaging. We calculated relative cerebral blood volume from DSC-MR imaging and mean diffusivity and fractional anisotropy from DTI. Following image coregistration, we assessed tumor cell density for each biopsy and identified corresponding localized MR imaging measurements. We then explored a range of univariate and multivariate predictive models of tumor cell density based on MR imaging measurements in a generalized one-model-fits-all approach. We then implemented both univariate and multivariate individualized transfer learning predictive models, which harness the available population-level data but allow individual variability in their predictions. Finally, we compared Pearson correlation coefficients and mean absolute error between the individualized transfer learning and generalized one-model-fits-all models. RESULTS:Tumor cell density significantly correlated with relative CBV (r = 0.33, P < .001), and T1-weighted postcontrast (r = 0.36, P < .001) on univariate analysis after correcting for multiple comparisons. With single-variable modeling (using relative CBV), transfer learning increased predictive performance (r = 0.53, mean absolute error = 15.19%) compared with one-model-fits-all (r = 0.27, mean absolute error = 17.79%). With multivariate modeling, transfer learning further improved performance (r = 0.88, mean absolute error = 5.66%) compared with one-model-fits-all (r = 0.39, mean absolute error = 16.55%). CONCLUSIONS:Transfer learning significantly improves predictive modeling performance for quantifying tumor cell density in glioblastoma.

Project description:BackgroundThe impending scale up of noncommunicable disease screening programs in low- and middle-income countries coupled with limited health resources require that such programs be as accurate as possible at identifying patients at high risk.ObjectiveThe aim of this study was to develop machine learning-based risk stratification algorithms for diabetes and hypertension that are tailored for the at-risk population served by community-based screening programs in low-resource settings.MethodsWe trained and tested our models by using data from 2278 patients collected by community health workers through door-to-door and camp-based screenings in the urban slums of Hyderabad, India between July 14, 2015 and April 21, 2018. We determined the best models for predicting short-term (2-month) risk of diabetes and hypertension (a model for diabetes and a model for hypertension) and compared these models to previously developed risk scores from the United States and the United Kingdom by using prediction accuracy as characterized by the area under the receiver operating characteristic curve (AUC) and the number of false negatives.ResultsWe found that models based on random forest had the highest prediction accuracy for both diseases and were able to outperform the US and UK risk scores in terms of AUC by 35.5% for diabetes (improvement of 0.239 from 0.671 to 0.910) and 13.5% for hypertension (improvement of 0.094 from 0.698 to 0.792). For a fixed screening specificity of 0.9, the random forest model was able to reduce the expected number of false negatives by 620 patients per 1000 screenings for diabetes and 220 patients per 1000 screenings for hypertension. This improvement reduces the cost of incorrect risk stratification by US $1.99 (or 35%) per screening for diabetes and US $1.60 (or 21%) per screening for hypertension.ConclusionsIn the next decade, health systems in many countries are planning to spend significant resources on noncommunicable disease screening programs and our study demonstrates that machine learning models can be leveraged by these programs to effectively utilize limited resources by improving risk stratification.

Dataset Information

Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers.

Publications

Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets