Dataset Information

A comparison of model selection methods for prediction in the presence of multiply imputed data.

ABSTRACT: Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. ≥50% ) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1-se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1-se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets.

SUBMITTER: Thao LTP

PROVIDER: S-EPMC6492211 | biostudies-literature | 2019 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A comparison of model selection methods for prediction in the presence of multiply imputed data.

Thao Le Thi Phuong LTP Geskus Ronald R

Biometrical journal. Biometrische Zeitschrift 20181023 2

Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. ≥50% ) sele ...[more]

PMID: 30353591

Similar Datasets

Project description:Background Assessing disparities in injury is crucial for injury prevention and for evaluating injury prevention strategies, but efforts have been hampered by missing data. This study aimed to show the utility and reliability of the injury surveillance system as a trustworthy resource for examining disparities by generating multiple imputed companion datasets. Methods We employed data from the National Electronic Injury Surveillance System-All Injury Program (NEISS-AIP) for the period 2014–2018. A comprehensive simulation study was conducted to identify the appropriate strategy for addressing missing data limitations in NEISS-AIP. To evaluate the imputation performance more quantitatively, a new method based on Brier Skill Score (BSS) was developed to assess the accuracy of predictions by different approaches. We selected the multiple imputations by fully conditional specification (FCS MI) to generate the imputed companion data to NEISS-AIP 2014–2018. We further assessed health disparities systematically in nonfatal assault injuries treated in U.S. hospital emergency departments (EDs) by race and ethnicity, location of injury and sex. Results We found for the first time that significantly higher age-adjusted nonfatal assault injury rates for ED visits per 100,000 population occurred among non-Hispanic Black persons (1306.8, 95% Confidence Interval [CI]: 660.1 – 1953.5), in public settings (286.3, 95% CI: 183.2 – 389.4) and for males (603.5, 95% CI: 409.4 – 797.5). We also observed similar trends in age-adjusted rates (AARs) by different subgroups for non-Hispanic Black persons, injuries occurring in public settings, and for males: AARs of nonfatal assault injury increased significantly from 2014 through 2017, then declined significantly in 2018. Conclusions Nonfatal assault injury imposes significant health care costs and productivity losses for millions of people each year. This study is the first to specifically look at health disparities in nonfatal assault injuries using multiply imputed companion data. Understanding how disparities differ by various groups may lead to the development of more effective initiatives to prevent such injury. Supplementary Information The online version contains supplementary material available at 10.1186/s12939-023-01940-4.

Dataset Information

A comparison of model selection methods for prediction in the presence of multiply imputed data.

Publications

A comparison of model selection methods for prediction in the presence of multiply imputed data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets