Project description:BackgroundCOPD is a leading cause of mortality.Research questionWe hypothesized that applying machine learning to clinical and quantitative CT imaging features would improve mortality prediction in COPD.Study design and methodsWe selected 30 clinical, spirometric, and imaging features as inputs for a random survival forest. We used top features in a Cox regression to create a machine learning mortality prediction (MLMP) in COPD model and also assessed the performance of other statistical and machine learning models. We trained the models in subjects with moderate to severe COPD from a subset of subjects in Genetic Epidemiology of COPD (COPDGene) and tested prediction performance in the remainder of individuals with moderate to severe COPD in COPDGene and Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE). We compared our model with the BMI, airflow obstruction, dyspnea, exercise capacity (BODE) index; BODE modifications; and the age, dyspnea, and airflow obstruction index.ResultsWe included 2,632 participants from COPDGene and 1,268 participants from ECLIPSE. The top predictors of mortality were 6-min walk distance, FEV1 % predicted, and age. The top imaging predictor was pulmonary artery-to-aorta ratio. The MLMP-COPD model resulted in a C index ≥ 0.7 in both COPDGene and ECLIPSE (6.4- and 7.2-year median follow-ups, respectively), significantly better than all tested mortality indexes (P < .05). The MLMP-COPD model had fewer predictors but similar performance to that of other models. The group with the highest BODE scores (7-10) had 64% mortality, whereas the highest mortality group defined by the MLMP-COPD model had 77% mortality (P = .012).InterpretationAn MLMP-COPD model outperformed four existing models for predicting all-cause mortality across two COPD cohorts. Performance of machine learning was similar to that of traditional statistical methods. The model is available online at: https://cdnm.shinyapps.io/cgmortalityapp/.
Project description:Gene expression profiles were generated from 199 primary breast cancer patients. Samples 1-176 were used in another study, GEO Series GSE22820, and form the training data set in this study. Sample numbers 200-222 form a validation set. This data is used to model a machine learning classifier for Estrogen Receptor Status. RNA was isolated from 199 primary breast cancer patients. A machine learning classifier was built to predict ER status using only three gene features.
Project description:Protein-protein interactions (PPI) control most of the biological processes in a living cell. In order to fully understand protein functions, a knowledge of protein-protein interactions is necessary. Prediction of PPI is challenging, especially when the three-dimensional structure of interacting partners is not known. Recently, a novel prediction method was proposed by exploiting physical interactions of constituent domains. We propose here a novel knowledge-based prediction method, namely PPI_SVM, which predicts interactions between two protein sequences by exploiting their domain information. We trained a two-class support vector machine on the benchmarking set of pairs of interacting proteins extracted from the Database of Interacting Proteins (DIP). The method considers all possible combinations of constituent domains between two protein sequences, unlike most of the existing approaches. Moreover, it deals with both single-domain proteins and multi domain proteins; therefore it can be applied to the whole proteome in high-throughput studies. Our machine learning classifier, following a brainstorming approach, achieves accuracy of 86%, with specificity of 95%, and sensitivity of 75%, which are better results than most previous methods that sacrifice recall values in order to boost the overall precision. Our method has on average better sensitivity combined with good selectivity on the benchmarking dataset. The PPI_SVM source code, train/test datasets and supplementary files are available freely in the public domain at: http://code.google.com/p/cmater-bioinfo/.
Project description:Background: Pediatric myocarditis is a rare disease. The etiologies are multiple. Mortality associated with the disease is 5-8%. Prognostic factors were identified with the use of national hospitalization databases. Applying these identified risk factors for mortality prediction has not been reported. Methods: We used the Kids' Inpatient Database for this project. We manually curated fourteen variables as predictors of mortality based on the current knowledge of the disease, and compared performance of mortality prediction between linear regression models and a machine learning (ML) model. For ML, the random forest algorithm was chosen because of the categorical nature of the variables. Based on variable importance scores, a reduced model was also developed for comparison. Results: We identified 4,144 patients from the database for randomization into the primary (for model development) and testing (for external validation) datasets. We found that the conventional logistic regression model had low sensitivity (~50%) despite high specificity (>95%) or overall accuracy. On the other hand, the ML model struck a good balance between sensitivity (89.9%) and specificity (85.8%). The reduced ML model with top five variables (mechanical ventilation, cardiac arrest, ECMO, acute kidney injury, ventricular fibrillation) were sufficient to approximate the prediction performance of the full model. Conclusions: The ML algorithm performs superiorly when compared to the linear regression model for mortality prediction in pediatric myocarditis in this retrospective dataset. Prospective studies are warranted to further validate the applicability of our model in clinical settings.
Project description:BACKGROUND:Prognostic modelling using standard methods is well-established, particularly for predicting risk of single diseases. Machine-learning may offer potential to explore outcomes of even greater complexity, such as premature death. This study aimed to develop novel prediction algorithms using machine-learning, in addition to standard survival modelling, to predict premature all-cause mortality. METHODS:A prospective population cohort of 502,628 participants aged 40-69 years were recruited to the UK Biobank from 2006-2010 and followed-up until 2016. Participants were assessed on a range of demographic, biometric, clinical and lifestyle factors. Mortality data by ICD-10 were obtained from linkage to Office of National Statistics. Models were developed using deep learning, random forest and Cox regression. Calibration was assessed by comparing observed to predicted risks; and discrimination by area under the 'receiver operating curve' (AUC). FINDINGS:14,418 deaths (2.9%) occurred over a total follow-up time of 3,508,454 person-years. A simple age and gender Cox model was the least predictive (AUC 0.689, 95% CI 0.681-0.699). A multivariate Cox regression model significantly improved discrimination by 6.2% (AUC 0.751, 95% CI 0.748-0.767). The application of machine-learning algorithms further improved discrimination by 3.2% using random forest (AUC 0.783, 95% CI 0.776-0.791) and 3.9% using deep learning (AUC 0.790, 95% CI 0.783-0.797). These ML algorithms improved discrimination by 9.4% and 10.1% respectively from a simple age and gender Cox regression model. Random forest and deep learning achieved similar levels of discrimination with no significant difference. Machine-learning algorithms were well-calibrated, while Cox regression models consistently over-predicted risk. CONCLUSIONS:Machine-learning significantly improved accuracy of prediction of premature all-cause mortality in this middle-aged population, compared to standard methods. This study illustrates the value of machine-learning for risk prediction within a traditional epidemiological study design, and how this approach might be reported to assist scientific verification.
Project description:Our aim was to create simple and largely scalable machine learning-based algorithms that could predict mortality in a real-time fashion during intensive care after traumatic brain injury. We performed an observational multicenter study including adult TBI patients that were monitored for intracranial pressure (ICP) for at least 24?h in three ICUs. We used machine learning-based logistic regression modeling to create two algorithms (based on ICP, mean arterial pressure [MAP], cerebral perfusion pressure [CPP] and Glasgow Coma Scale [GCS]) to predict 30-day mortality. We used a stratified cross-validation technique for internal validation. Of 472 included patients, 92 patients (19%) died within 30 days. Following cross-validation, the ICP-MAP-CPP algorithm's area under the receiver operating characteristic curve (AUC) increased from 0.67 (95% confidence interval [CI] 0.60-0.74) on day 1 to 0.81 (95% CI 0.75-0.87) on day 5. The ICP-MAP-CPP-GCS algorithm's AUC increased from 0.72 (95% CI 0.64-0.78) on day 1 to 0.84 (95% CI 0.78-0.90) on day 5. Algorithm misclassification was seen among patients undergoing decompressive craniectomy. In conclusion, we present a new concept of dynamic prognostication for patients with TBI treated in the ICU. Our simple algorithms, based on only three and four main variables, discriminated between survivors and non-survivors with accuracies up to 81% and 84%. These open-sourced simple algorithms can likely be further developed, also in low and middle-income countries.
Project description:In this study, we aimed to develop and validate a machine learning-based mortality prediction model for hospitalized heat-related illness patients. After 2393 hospitalized patients were extracted from a multicentered heat-related illness registry in Japan, subjects were divided into the training set for development (n = 1516, data from 2014, 2017-2019) and the test set (n = 877, data from 2020) for validation. Twenty-four variables including characteristics of patients, vital signs, and laboratory test data at hospital arrival were trained as predictor features for machine learning. The outcome was death during hospital stay. In validation, the developed machine learning models (logistic regression, support vector machine, random forest, XGBoost) demonstrated favorable performance for outcome prediction with significantly increased values of the area under the precision-recall curve (AUPR) of 0.415 [95% confidence interval (CI) 0.336-0.494], 0.395 [CI 0.318-0.472], 0.426 [CI 0.346-0.506], and 0.528 [CI 0.442-0.614], respectively, compared to that of the conventional acute physiology and chronic health evaluation (APACHE)-II score of 0.287 [CI 0.222-0.351] as a reference standard. The area under the receiver operating characteristic curve (AUROC) values were also high over 0.92 in all models, although there were no statistical differences compared to APACHE-II. This is the first demonstration of the potential of machine learning-based mortality prediction models for heat-related illnesses.
Project description:ObjectiveTo determine the effects of using unstructured clinical text in machine learning (ML) for prediction, early detection, and identification of sepsis.Materials and methodsPubMed, Scopus, ACM DL, dblp, and IEEE Xplore databases were searched. Articles utilizing clinical text for ML or natural language processing (NLP) to detect, identify, recognize, diagnose, or predict the onset, development, progress, or prognosis of systemic inflammatory response syndrome, sepsis, severe sepsis, or septic shock were included. Sepsis definition, dataset, types of data, ML models, NLP techniques, and evaluation metrics were extracted.ResultsThe clinical text used in models include narrative notes written by nurses, physicians, and specialists in varying situations. This is often combined with common structured data such as demographics, vital signs, laboratory data, and medications. Area under the receiver operating characteristic curve (AUC) comparison of ML methods showed that utilizing both text and structured data predicts sepsis earlier and more accurately than structured data alone. No meta-analysis was performed because of incomparable measurements among the 9 included studies.DiscussionStudies focused on sepsis identification or early detection before onset; no studies used patient histories beyond the current episode of care to predict sepsis. Sepsis definition affects reporting methods, outcomes, and results. Many methods rely on continuous vital sign measurements in intensive care, making them not easily transferable to general ward units.ConclusionsApproaches were heterogeneous, but studies showed that utilizing both unstructured text and structured data in ML can improve identification and early detection of sepsis.
Project description:BackgroundRecent decreases in neonatal mortality have been slower than expected for most countries. This study aims to predict the risk of neonatal mortality using only data routinely available from birth records in the largest city of the Americas.MethodsA probabilistic linkage of every birth record occurring in the municipality of São Paulo, Brazil, between 2012 e 2017 was performed with the death records from 2012 to 2018 (1,202,843 births and 447,687 deaths), and a total of 7282 neonatal deaths were identified (a neonatal mortality rate of 6.46 per 1000 live births). Births from 2012 and 2016 (N = 941,308; or 83.44% of the total) were used to train five different machine learning algorithms, while births occurring in 2017 (N = 186,854; or 16.56% of the total) were used to test their predictive performance on new unseen data.ResultsThe best performance was obtained by the extreme gradient boosting trees (XGBoost) algorithm, with a very high AUC of 0.97 and F1-score of 0.55. The 5% births with the highest predicted risk of neonatal death included more than 90% of the actual neonatal deaths. On the other hand, there were no deaths among the 5% births with the lowest predicted risk. There were no significant differences in predictive performance for vulnerable subgroups. The use of a smaller number of variables (WHO's five minimum perinatal indicators) decreased overall performance but the results still remained high (AUC of 0.91). With the addition of only three more variables, we achieved the same predictive performance (AUC of 0.97) as using all the 23 variables originally available from the Brazilian birth records.ConclusionMachine learning algorithms were able to identify with very high predictive performance the neonatal mortality risk of newborns using only routinely collected data.
Project description:Gene expression profiles were generated from 199 primary breast cancer patients. Samples 1-176 were used in another study, GEO Series GSE22820, and form the training data set in this study. Sample numbers 200-222 form a validation set. This data is used to model a machine learning classifier for Estrogen Receptor Status.