Dataset Information

Performance of binary prediction models in high-correlation low-dimensional settings: a comparison of methods.

ABSTRACT:

Background

Clinical prediction models are developed widely across medical disciplines. When predictors in such models are highly collinear, unexpected or spurious predictor-outcome associations may occur, thereby potentially reducing face-validity of the prediction model. Collinearity can be dealt with by exclusion of collinear predictors, but when there is no a priori motivation (besides collinearity) to include or exclude specific predictors, such an approach is arbitrary and possibly inappropriate.

Methods

We compare different methods to address collinearity, including shrinkage, dimensionality reduction, and constrained optimization. The effectiveness of these methods is illustrated via simulations.

Results

In the conducted simulations, no effect of collinearity was observed on predictive outcomes (AUC, R², Intercept, Slope) across methods. However, a negative effect of collinearity on the stability of predictor selection was found, affecting all compared methods, but in particular methods that perform strong predictor selection (e.g., Lasso). Methods for which the included set of predictors remained most stable under increased collinearity were Ridge, PCLR, LAELR, and Dropout.

Conclusions

Based on the results, we would recommend refraining from data-driven predictor selection approaches in the presence of high collinearity, because of the increased instability of predictor selection, even in relatively high events-per-variable settings. The selection of certain predictors over others may disproportionally give the impression that included predictors have a stronger association with the outcome than excluded predictors.

SUBMITTER: Leeuwenberg AM

PROVIDER: S-EPMC8751246 | biostudies-literature | 2022 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Performance of binary prediction models in high-correlation low-dimensional settings: a comparison of methods.

Leeuwenberg Artuur M AM van Smeden Maarten M Langendijk Johannes A JA van der Schaaf Arjen A Mauer Murielle E ME Moons Karel G M KGM Reitsma Johannes B JB Schuit Ewoud E

Diagnostic and prognostic research 20220111 1

<h4>Background</h4>Clinical prediction models are developed widely across medical disciplines. When predictors in such models are highly collinear, unexpected or spurious predictor-outcome associations may occur, thereby potentially reducing face-validity of the prediction model. Collinearity can be dealt with by exclusion of collinear predictors, but when there is no a priori motivation (besides collinearity) to include or exclude specific predictors, such an approach is arbitrary and possibly ...[more]

PMID: 35016734

Similar Datasets

Project description:ImportanceThe lack of standards in methods to reduce bias for clinical algorithms presents various challenges in providing reliable predictions and in addressing health disparities.ObjectiveTo evaluate approaches for reducing bias in machine learning models using a real-world clinical scenario.Design, setting, and participantsHealth data for this cohort study were obtained from the IBM MarketScan Medicaid Database. Eligibility criteria were as follows: (1) Female individuals aged 12 to 55 years with a live birth record identified by delivery-related codes from January 1, 2014, through December 31, 2018; (2) greater than 80% enrollment through pregnancy to 60 days post partum; and (3) evidence of coverage for depression screening and mental health services. Statistical analysis was performed in 2020.ExposuresBinarized race (Black individuals and White individuals).Main outcomes and measuresMachine learning models (logistic regression [LR], random forest, and extreme gradient boosting) were trained for 2 binary outcomes: postpartum depression (PPD) and postpartum mental health service utilization. Risk-adjusted generalized linear models were used for each outcome to assess potential disparity in the cohort associated with binarized race (Black or White). Methods for reducing bias, including reweighing, Prejudice Remover, and removing race from the models, were examined by analyzing changes in fairness metrics compared with the base models. Baseline characteristics of female individuals at the top-predicted risk decile were compared for systematic differences. Fairness metrics of disparate impact (DI, 1 indicates fairness) and equal opportunity difference (EOD, 0 indicates fairness).ResultsAmong 573 634 female individuals initially examined for this study, 314 903 were White (54.9%), 217 899 were Black (38.0%), and the mean (SD) age was 26.1 (5.5) years. The risk-adjusted odds ratio comparing White participants with Black participants was 2.06 (95% CI, 2.02-2.10) for clinically recognized PPD and 1.37 (95% CI, 1.33-1.40) for postpartum mental health service utilization. Taking the LR model for PPD prediction as an example, reweighing reduced bias as measured by improved DI and EOD metrics from 0.31 and -0.19 to 0.79 and 0.02, respectively. Removing race from the models had inferior performance for reducing bias compared with the other methods (PPD: DI = 0.61; EOD = -0.05; mental health service utilization: DI = 0.63; EOD = -0.04).Conclusions and relevanceClinical prediction models trained on potentially biased data may produce unfair outcomes on the basis of the chosen metrics. This study's results suggest that the performance varied depending on the model, outcome label, and method for reducing bias. This approach toward evaluating algorithmic bias can be used as an example for the growing number of researchers who wish to examine and address bias in their data and models.

Project description:Low salinity waterflooding (LSWF) and its variants also known as smart water or ion tuned water injection have emerged as promising enhanced oil recovery (EOR) methods. LSWF is a complex process controlled by several mechanisms and parameters involving oil, brine, and rock composition. The major mechanisms and processes controlling LSWF are still being debated in the literature. Thus, the establishment of an approach that relates these parameters to the final recovery factor (RFf) is vital. The main objective of this research work was to use a number of artificial intelligence models to develop robust predictive models based on experimental data and main parameters controlling the LSWF determined through sensitivity analysis and feature selection. The parameters include properties of oil, rock, injected brine, and connate water. Different operational parameters were considered to increase the model accuracy as well. After collecting the relevant data from 99 experimental studies reported in the literature, the database underwent a comprehensive and rigorous data preprocessing stage, which included removal of duplicates and low-variance features, missing value imputation, collinearity assessment, data characteristic assessment, outlier removal, feature selection, data splitting (80-20 rule was applied), and data scaling. Then, a number of methods such as linear regression (LR), multilayer perceptron (MLP), support vector machine (SVM), and committee machine intelligent system (CMIS) were used to link 1316 data samples assembled in this research work. Based on the obtained results, the CMIS model was proven to produce superior results compared to its counterparts such that the root mean squared rrror (RMSE) values for both training and testing data are 4.622 and 7.757, respectively. Based on the feature importance results, the presence of Ca2+ in the connate water, Na+ in the injected brine, core porosity, and total acid number of the crude oil are detected as the parameters with the highest impact on the RFf. The CMIS model proposed here can be applied with a high degree of confidence to predict the performance of LSWF in sandstone reservoirs. The database assembled for the purpose of this research work is so far the largest and most comprehensive of its kind, and it can be used to further delineate mechanisms behind LSWF and optimization of this EOR process in sandstone reservoirs.

Dataset Information

Performance of binary prediction models in high-correlation low-dimensional settings: a comparison of methods.

Background

Methods

Results

Conclusions

Publications

Performance of binary prediction models in high-correlation low-dimensional settings: a comparison of methods.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets