Dataset Information

Selection of variables for multivariable models: Opportunities and limitations in quantifying model stability by resampling.

ABSTRACT: Statistical models are often fitted to obtain a concise description of the association of an outcome variable with some covariates. Even if background knowledge is available to guide preselection of covariates, stepwise variable selection is commonly applied to remove irrelevant ones. This practice may introduce additional variability and selection is rarely certain. However, these issues are often ignored and model stability is not questioned. Several resampling-based measures were proposed to describe model stability, including variable inclusion frequencies (VIFs), model selection frequencies, relative conditional bias (RCB), and root mean squared difference ratio (RMSDR). The latter two were recently proposed to assess bias and variance inflation induced by variable selection. Here, we study the consistency and accuracy of resampling estimates of these measures and the optimal choice of the resampling technique. In particular, we compare subsampling and bootstrapping for assessing stability of linear, logistic, and Cox models obtained by backward elimination in a simulation study. Moreover, we exemplify the estimation and interpretation of all suggested measures in a study on cardiovascular risk. The VIF and the model selection frequency are only consistently estimated in the subsampling approach. By contrast, the bootstrap is advantageous in terms of bias and precision for estimating the RCB as well as the RMSDR. Though, unbiased estimation of the latter quantity requires independence of covariates, which is rarely encountered in practice. Our study stresses the importance of addressing model stability after variable selection and shows how to cope with it.

SUBMITTER: Wallisch C

PROVIDER: S-EPMC7820988 | biostudies-literature | 2021 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Selection of variables for multivariable models: Opportunities and limitations in quantifying model stability by resampling.

Wallisch Christine C Dunkler Daniela D Rauch Geraldine G de Bin Riccardo R Heinze Georg G

Statistics in medicine 20201021 2

Statistical models are often fitted to obtain a concise description of the association of an outcome variable with some covariates. Even if background knowledge is available to guide preselection of covariates, stepwise variable selection is commonly applied to remove irrelevant ones. This practice may introduce additional variability and selection is rarely certain. However, these issues are often ignored and model stability is not questioned. Several resampling-based measures were proposed to ...[more]

PMID: 33089538

Similar Datasets

Project description:BackgroundHow to select variables and identify functional forms for continuous variables is a key concern when creating a multivariable model. Ad hoc 'traditional' approaches to variable selection have been in use for at least 50 years. Similarly, methods for determining functional forms for continuous variables were first suggested many years ago. More recently, many alternative approaches to address these two challenges have been proposed, but knowledge of their properties and meaningful comparisons between them are scarce. To define a state of the art and to provide evidence-supported guidance to researchers who have only a basic level of statistical knowledge, many outstanding issues in multivariable modelling remain. Our main aims are to identify and illustrate such gaps in the literature and present them at a moderate technical level to the wide community of practitioners, researchers and students of statistics.MethodsWe briefly discuss general issues in building descriptive regression models, strategies for variable selection, different ways of choosing functional forms for continuous variables and methods for combining the selection of variables and functions. We discuss two examples, taken from the medical literature, to illustrate problems in the practice of modelling.ResultsOur overview revealed that there is not yet enough evidence on which to base recommendations for the selection of variables and functional forms in multivariable analysis. Such evidence may come from comparisons between alternative methods. In particular, we highlight seven important topics that require further investigation and make suggestions for the direction of further research.ConclusionsSelection of variables and of functional forms are important topics in multivariable analysis. To define a state of the art and to provide evidence-supported guidance to researchers who have only a basic level of statistical knowledge, further comparative research is required.

Project description:Within aquaculture industries, selection based on genomic information (genomic selection) has the profound potential to change genetic improvement programs and production systems. Genomic selection exploits the use of realized genomic relationships among individuals and information from genome-wide markers in close linkage disequilibrium with genes of biological and economic importance. We discuss the technical advances, practical requirements, and commercial applications that have made genomic selection feasible in a range of aquaculture industries, with a particular focus on molluscs (pearl oysters, Pinctada maxima) and marine shrimp (Litopenaeus vannamei and Penaeus monodon). The use of low-cost genome sequencing has enabled cost-effective genotyping on a large scale and is of particular value for species without a reference genome or access to commercial genotyping arrays. We highlight the pitfalls and offer the solutions to the genotyping by sequencing approach and the building of appropriate genetic resources to undertake genomic selection from first-hand experience. We describe the potential to capture large-scale commercial phenotypes based on image analysis and artificial intelligence through machine learning, as inputs for calculation of genomic breeding values. The application of genomic selection over traditional aquatic breeding programs offers significant advantages through being able to accurately predict complex polygenic traits including disease resistance; increasing rates of genetic gain; minimizing inbreeding; and negating potential limiting effects of genotype by environment interactions. Further practical advantages of genomic selection through the use of large-scale communal mating and rearing systems are highlighted, as well as presenting rate-limiting steps that impact on attaining maximum benefits from adopting genomic selection. Genomic selection is now at the tipping point where commercial applications can be readily adopted and offer significant short- and long-term solutions to sustainable and profitable aquaculture industries.

Dataset Information

Selection of variables for multivariable models: Opportunities and limitations in quantifying model stability by resampling.

Publications

Selection of variables for multivariable models: Opportunities and limitations in quantifying model stability by resampling.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets