Dataset Information

Advanced methods for missing values imputation based on similarity learning.

ABSTRACT: The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods' accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.

SUBMITTER: Fouad KM

PROVIDER: S-EPMC8323724 | biostudies-literature | 2021

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Advanced methods for missing values imputation based on similarity learning.

Fouad Khaled M KM Ismail Mahmoud M MM Azar Ahmad Taher AT Arafa Mona M MM

PeerJ. Computer science 20210721

The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods' accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a chal ...[more]

PMID: 34395861

Similar Datasets

Project description:BackgroundMissing data on tumour stage information is a common problem in population-based cancer registries. Statistical analyses on the level of tumour stage may be biased, if no adequate method for handling of missing data is applied. In order to determine a useful way to treat missing data on tumour stage, we examined different imputation models for multiple imputation with chained equations for analysing the stage-specific numbers of cases of malignant melanoma and female breast cancer.MethodsThis analysis was based on the malignant melanoma data set and the female breast cancer data set of the cancer registry Schleswig-Holstein, Germany. The cases with complete tumour stage information were extracted and their stage information partly removed according to a MAR missingness-pattern, resulting in five simulated data sets for each cancer entity. The missing tumour stage values were then treated with multiple imputation with chained equations, using polytomous regression, predictive mean matching, random forests and proportional sampling as imputation models. The estimated tumour stages, stage-specific numbers of cases and survival curves after multiple imputation were compared to the observed ones.ResultsThe amount of missing values for malignant melanoma was too high to estimate a reasonable number of cases for each UICC stage. However, multiple imputation of missing stage values led to stage-specific numbers of cases of T-stage for malignant melanoma as well as T- and UICC-stage for breast cancer close to the observed numbers of cases. The observed tumour stages on the individual level, the stage-specific numbers of cases and the observed survival curves were best met with polytomous regression or predictive mean matching but not with random forest or proportional sampling as imputation models.ConclusionsThis limited simulation study indicates that multiple imputation with chained equations is an appropriate technique for dealing with missing information on tumour stage in population-based cancer registries, if the amount of unstaged cases is on a reasonable level.

Project description:BackgroundData collected by an actigraphy device worn on the wrist or waist can provide objective measurements for studies related to physical activity; however, some data may contain intervals where values are missing. In previous studies, statistical methods have been applied to impute missing values on the basis of statistical assumptions. Deep learning algorithms, however, can learn features from the data without any such assumptions and may outperform previous approaches in imputation tasks.ObjectiveThe aim of this study was to impute missing values in data using a deep learning approach.MethodsTo develop an imputation model for missing values in accelerometer-based actigraphy data, a denoising convolutional autoencoder was adopted. We trained and tested our deep learning-based imputation model with the National Health and Nutrition Examination Survey data set and validated it with the external Korea National Health and Nutrition Examination Survey and the Korean Chronic Cerebrovascular Disease Oriented Biobank data sets which consist of daily records measuring activity counts. The partial root mean square error and partial mean absolute error of the imputed intervals (partial RMSE and partial MAE, respectively) were calculated using our deep learning-based imputation model (zero-inflated denoising convolutional autoencoder) as well as using other approaches (mean imputation, zero-inflated Poisson regression, and Bayesian regression).ResultsThe zero-inflated denoising convolutional autoencoder exhibited a partial RMSE of 839.3 counts and partial MAE of 431.1 counts, whereas mean imputation achieved a partial RMSE of 1053.2 counts and partial MAE of 545.4 counts, the zero-inflated Poisson regression model achieved a partial RMSE of 1255.6 counts and partial MAE of 508.6 counts, and Bayesian regression achieved a partial RMSE of 924.5 counts and partial MAE of 605.8 counts.ConclusionsOur deep learning-based imputation model performed better than the other methods when imputing missing values in actigraphy data.

Project description:BackgroundMissing preadmission serum creatinine (SCr) values are a common obstacle to assess acute kidney injury (AKI) diagnosis and outcomes. The Kidney Disease Improving Global Outcomes (KDIGO) guidelines suggest using a SCr computed from the Modification of Diet in Renal Disease (MDRD) with an estimated glomerular filtration rate of 75 ml/min/1.73 m2. We aimed to identify the best surrogate method for baseline SCr to assess AKI diagnosis and outcomes.MethodsWe compared the use of 1) first SCr at hospital admission 2) minimal SCr over 2 weeks after intensive care unit admission 3) MDRD computed SCr and 4) Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) computed SCr to assess AKI diagnosis and outcomes. We then performed multilinear regression models to predict preadmission SCr and imputation strategies to assess AKI diagnosis.ResultsOur one-year retrospective cohort study included 1001 critically ill adults; 498 of them had preadmission SCr values. In these patients, AKI incidence was 25.1% using preadmission SCr. First SCr had the best agreement for AKI diagnosis (22.5%; kappa = 0.90) and staging (kappa = 0.81). MDRD, CKD-EPI and minimal SCr overestimated AKI diagnosis (26.7%, 27.1% and 43.2%;kappa = 0.86, 0.86 and 0.60, respectively). However, MDRD and CKD-EPI computed SCr had a better sensitivity than first SCr for AKI (93% and 94% vs. 87%). Eighty-eight percent of patients experienced renal recovery at least 3 months after hospital discharge. All methods except the first SCr significantly underestimated the percentage of renal recovery. In a multivariate model, age, male gender, hypertension, heart failure, undergoing surgery and log first SCr best predicted preadmission SCr (adjusted R2 = 0.56). Imputation methods with first SCr increased AKI incidence to 23.9% (kappa = 0.92) but not with MDRD computed SCr (26.7%;kappa = 0.89).ConclusionIn our cohort, first SCr performed better for AKI diagnosis and staging, as well as for renal recovery after hospital discharge than MDRD, CKD-EPI or minimal SCr. However, MDRD SCr and CKD-EPI SCr improved AKI diagnosis sensitivity. Imputation methods minimally increased agreement for AKI diagnosis.

Project description:In various missing data problems, values are not entirely missing, but are coarsened. For coarsened observations, instead of observing the true value, a subset of values - strictly smaller than the full sample space of the variable - is observed to which the true value belongs. In our motivating example for patients with endometrial carcinoma, the degree of lymphovascular space invasion (LVSI) can be either absent, focally present, or substantially present. For a subset of individuals, however, LVSI is reported as being present, which includes both non-absent options. In the analysis of such a dataset, difficulties arise when coarsened observations are to be used in an imputation procedure. To our knowledge, no clear-cut method has been described in the literature on how to handle an observed subset of values, and treating them as entirely missing could lead to biased estimates. Therefore, in this paper, we evaluated the best strategy to deal with coarsened and missing data in multiple imputation. We tested a number of plausible ad hoc approaches, possibly already in use by statisticians. Additionally, we propose a principled approach to this problem, consisting of an adaptation of the SMC-FCS algorithm (SMC-FCS CoCo$$ {}_{\mathrm{CoCo}} $$ : Coarsening compatible), that ensures that imputed values adhere to the coarsening information. These methods were compared in a simulation study. This comparison shows that methods that prevent imputations of incompatible values, like the SMC-FCS CoCo$$ {}_{\mathrm{CoCo}} $$ method, perform consistently better in terms of a lower bias and RMSE, and achieve better coverage than methods that ignore coarsening or handle it in a more naïve way. The analysis of the motivating example shows that the way the coarsening information is handled can matter substantially, leading to different conclusions across methods. Overall, our proposed SMC-FCS CoCo$$ {}_{\mathrm{CoCo}} $$ method outperforms other methods in handling coarsened data, requires limited additional computation cost and is easily extendable to other scenarios.

Dataset Information

Advanced methods for missing values imputation based on similarity learning.

Publications

Advanced methods for missing values imputation based on similarity learning.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets