Dataset Information

Development of a novel machine learning model to predict presence of nonalcoholic steatohepatitis.

ABSTRACT:

Objective

To develop a computer model to predict patients with nonalcoholic steatohepatitis (NASH) using machine learning (ML).

Materials and methods

This retrospective study utilized two databases: a) the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) nonalcoholic fatty liver disease (NAFLD) adult database (2004-2009), and b) the Optum® de-identified Electronic Health Record dataset (2007-2018), a real-world dataset representative of common electronic health records in the United States. We developed an ML model to predict NASH, using confirmed NASH and non-NASH based on liver histology results in the NIDDK dataset to train the model.

Results

Models were trained and tested on NIDDK NAFLD data (704 patients) and the best-performing models evaluated on Optum data (~3,000,000 patients). An eXtreme Gradient Boosting model (XGBoost) consisting of 14 features exhibited high performance as measured by area under the curve (0.82), sensitivity (81%), and precision (81%) in predicting NASH. Slightly reduced performance was observed with an abbreviated feature set of 5 variables (0.79, 80%, 80%, respectively). The full model demonstrated good performance (AUC 0.76) to predict NASH in Optum data.

Discussion

The proposed model, named NASHmap, is the first ML model developed with confirmed NASH and non-NASH cases as determined through liver biopsy and validated on a large, real-world patient dataset. Both the 14 and 5-feature versions exhibit high performance.

Conclusion

The NASHmap model is a convenient and high performing tool that could be used to identify patients likely to have NASH in clinical settings, allowing better patient management and optimal allocation of clinical resources.

SUBMITTER: Docherty M

PROVIDER: S-EPMC8200272 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:BackgroundNonalcoholic Steatohepatitis (NASH) results from complex liver conditions involving metabolic, inflammatory, and fibrogenic processes. Despite its burden, there has been a lack of any approved food-and-drug administration therapy up till now.PurposeUtilizing machine learning (ML) algorithms, the study aims to identify reliable potential genes to accurately predict the treatment response in the NASH animal model using biochemical and molecular markers retrieved using bioinformatics techniques.MethodsThe NASH-induced rat models were administered various microbiome-targeted therapies and herbal drugs for 12 weeks, these drugs resulted in reducing hepatic lipid accumulation, liver inflammation, and histopathological changes. The ML model was trained and tested based on the Histopathological NASH score (HPS); while (0-4) HPS considered Improved NASH and (5-8) considered non-improved, confirmed through rats' liver histopathological examination, incorporates 34 features comprising 20 molecular markers (mRNAs-microRNAs-Long non-coding-RNAs) and 14 biochemical markers that are highly enriched in NASH pathogenesis. Six different ML models were used in the proposed model for the prediction of NASH improvement, with Gradient Boosting demonstrating the highest accuracy of 98% in predicting NASH drug response.FindingsFollowing a gradual reduction in features, the outcomes demonstrated superior performance when employing the Random Forest classifier, yielding an accuracy of 98.4%. The principal selected molecular features included YAP1, LATS1, NF2, SRD5A3-AS1, FOXA2, TEAD2, miR-650, MMP14, ITGB1, and miR-6881-5P, while the biochemical markers comprised triglycerides (TG), ALT, ALP, total bilirubin (T. Bilirubin), alpha-fetoprotein (AFP), and low-density lipoprotein cholesterol (LDL-C).ConclusionThis study introduced an ML model incorporating 16 noninvasive features, including molecular and biochemical signatures, which achieved high performance and accuracy in detecting NASH improvement. This model could potentially be used as diagnostic tools and to identify target therapies.

Project description:ImportanceMachine-learning algorithms offer better predictive accuracy than traditional prognostic models but are too complex and opaque for clinical use.ObjectiveTo compare different machine learning methods in predicting overall mortality in cirrhosis and to use machine learning to select easily scored clinical variables for a novel cirrhosis prognostic model.Design, setting, and participantsThis prognostic study used a retrospective cohort of adult patients with cirrhosis or its complications seen in 130 hospitals and affiliated ambulatory clinics in the integrated, national Veterans Affairs health care system from October 1, 2011, to September 30, 2015. Patients were followed up through December 31, 2018. Data were analyzed from October 1, 2017, to May 31, 2020.ExposuresPotential predictors included demographic characteristics; liver disease etiology, severity, and complications; use of health care resources; comorbid conditions; and comprehensive laboratory and medication data. Patients were randomly selected for model development (66.7%) and validation (33.3%). Three different statistical and machine learning methods were evaluated: gradient descent boosting, logistic regression with least absolute shrinkage and selection operator (LASSO) regularization, and logistic regression with LASSO constrained to select no more than 10 predictors (partial pathway model). Predictor inclusion and model performance were evaluated in a 5-fold cross-validation. Last, the predictors identified in the most parsimonious (the partial path) model were refit using maximum-likelihood estimation (Cirrhosis Mortality Model [CiMM]), and its predictive performance was compared with that of the widely used Model for End Stage Liver Disease with sodium (MELD-Na) score.Main outcomes and measuresAll-cause mortality.ResultsOf the 107 939 patients with cirrhosis (mean [SD] age, 62.7 [9.6] years; 96.6% male; 66.3% white, 18.4% African American), the annual mortality rate ranged from 8.8% to 15.3%. In total, 32.7% of patients died within 3 years, and 46.2% died within 5 years after the index date. Models predicting 1-year mortality had good discrimination for the gradient descent boosting (area under the receiver operating characteristics curve [AUC], 0.81; 95% CI, 0.80-0.82), logistic regression with LASSO regularization (AUC, 0.78; 95% CI, 0.77-0.79), and the partial path logistic model (AUC, 0.78; 95% CI, 0.76-0.78). All models showed good calibration. The final CiMM model with machine learning-derived clinical variables offered significantly better discrimination than the MELD-Na score, with AUCs of 0.78 (95% CI, 0.77-0.79) vs 0.67 (95% CI, 0.66-0.68) for 1-year mortality, respectively (DeLong z = 17.00; P < .001).Conclusions and relevanceIn this study, simple machine learning techniques performed as well as the more advanced ensemble gradient boosting. Using the clinical variables identified from simple machine learning in a cirrhosis mortality model produced a new score more transparent than machine learning and more predictive than the MELD-Na score.

Project description:BackgroundVolume overload is a common complication encountered in hospitalized patients, and the mainstay of therapy is diuresis. Unfortunately, the diuretic response in some individuals is inadequate despite a typical dose of loop diuretics, a phenomenon called diuretic resistance. An accurate prediction model that predicts diuretic resistance using predosing variables could inform the right diuretic dose for a prospective patient.MethodsTwo large, deidentified, publicly available, and independent intensive care unit (ICU) databases from the United States were used-the Medical Information Mart for Intensive Care III (MIMIC) and the Philips eICU databases. Loop diuretic resistance was defined as <1400 ml of urine per 40 mg of diuretic dose in 24 hours. Using 24-hour windows throughout admission, commonly accessible variables were obtained and incorporated into the model. Data imputation was performed using a highly accurate machine learning method. Using XGBoost, several models were created using train and test datasets from the eICU database. These were then combined into an ensemble model optimized for increased specificity and then externally validated on the MIMIC database.ResultsThe final ensemble model was composed of four separate models, each using 21 commonly available variables. The ensemble model outperformed individual models during validation. Higher serum creatinine, lower systolic blood pressure, lower serum chloride, higher age, and female sex were the most important predictors of diuretic resistance (in that order). The specificity of the model on external validation was 92%, yielding a positive likelihood ratio of 3.46 while maintaining overall discrimination (C-statistic 0.69).ConclusionsA diuretic resistance prediction model was created using machine learning and was externally validated in ICU populations. The model is easy to use, would provide actionable information at the bedside, and would be ready for implementation in existing electronic medical records. This study also provides a framework for the development of future machine learning models.

Project description:BackgroundPostoperative sepsis is one of the main causes of mortality after liver transplantation (LT). Our study aimed to develop and validate a predictive model for postoperative sepsis within 7 d in LT recipients using machine learning (ML) technology.MethodsData of 786 patients received LT from January 2015 to January 2020 was retrospectively extracted from the big data platform of Third Affiliated Hospital of Sun Yat-sen University. Seven ML models were developed to predict postoperative sepsis. The area under the receiver-operating curve (AUC), sensitivity, specificity, accuracy, and f1-score were evaluated as the model performances. The model with the best performance was validated in an independent dataset involving 118 adult LT cases from February 2020 to April 2021. The postoperative sepsis-associated outcomes were also explored in the study.ResultsAfter excluding 109 patients according to the exclusion criteria, 677 patients underwent LT were finally included in the analysis. Among them, 216 (31.9%) were diagnosed with sepsis after LT, which were related to more perioperative complications, increased postoperative hospital stay and mortality after LT (all p < .05). Our results revealed that a larger volume of red blood cell infusion, ascitic removal, blood loss and gastric drainage, less volume of crystalloid infusion and urine, longer anesthesia time, higher level of preoperative TBIL were the top 8 important variables contributing to the prediction of post-LT sepsis. The Random Forest Classifier (RF) model showed the best overall performance to predict sepsis after LT among the seven ML models developed in the study, with an AUC of 0.731, an accuracy of 71.6%, the sensitivity of 62.1%, and specificity of 76.1% in the internal validation set, and a comparable AUC of 0.755 in the external validation set.ConclusionsOur study enrolled eight pre- and intra-operative variables to develop an RF-based predictive model of post-LT sepsis to assist clinical decision-making procedure.