Dataset Information

The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.

ABSTRACT: An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric "missForest" based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.

SUBMITTER: Guo CY

PROVIDER: S-EPMC8289437 | biostudies-literature | 2021

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.

Guo Chao-Yu CY Yang Ying-Chen YC Chen Yi-Hau YH

Frontiers in public health 20210705

An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanis ...[more]

PMID: 34291028

Similar Datasets

Project description:PurposeSignet ring cell carcinoma (SRCC) is a rare type of lung cancer. The conventional survival nomogram used to predict lung cancer performs poorly for SRCC. Therefore, a novel nomogram specifically for studying SRCC is highly required.MethodsBaseline characteristics of lung signet ring cell carcinoma were obtained from the Surveillance, Epidemiology, and End Results (SEER) database. Univariate and multivariate Cox regression and random forest analysis were performed on the training group data, respectively. Subsequently, we compared results from these two types of analyses. A nomogram model was developed to predict 1-year, 3-year, and 5-year overall survival (OS) for patients, and receiver operating characteristic (ROC) curves and calibration curves were used to assess the prediction accuracy. Decision curve analysis (DCA) was used to assess the clinical applicability of the proposed model. For treatment modalities, Kaplan-Meier curves were adopted to analyze condition-specific effects.ResultsWe obtained 731 patients diagnosed with lung signet ring cell carcinoma (LSRCC) in the SEER database and randomized the patients into a training group (551) and a validation group (220) with a ratio of 7:3. Eight factors including age, primary site, T, N, and M.Stage, surgery, chemotherapy, and radiation were included in the nomogram analysis. Results suggested that treatment methods (like surgery, chemotherapy, and radiation) and T-Stage factors had significant prognostic effects. The results of ROC curves, calibration curves, and DCA in the training and validation groups demonstrated that the nomogram we constructed could precisely predict survival and prognosis in LSRCC patients. Through deep verification, we found the constructed model had a high C-index, indicating that the model had a strong predictive power. Further, we found that all surgical interventions had good effects on OS and cancer-specific survival (CSS). The survival curves showed a relatively favorable prognosis for T0 patients overall, regardless of the treatment modality.ConclusionsOur nomogram is demonstrated to be clinically beneficial for the prognosis of LSRCC patients. The surgical intervention was successful regardless of the tumor stage, and the Cox proportional hazard (CPH) model had better performance than the machine learning model in terms of effectiveness.

Dataset Information

The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.

Publications

The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets