Dataset Information

Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation.

ABSTRACT:

Background

With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies.

Focus

The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects.

Data

We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects.

Methods

We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.

SUBMITTER: Soneson C

PROVIDER: S-EPMC4072626 | biostudies-literature | 2014

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation.

Soneson Charlotte C Gerster Sarah S Delorenzi Mauro M

PloS one 20140626 6

<h4>Background</h4>With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies.<h4>Focus</h4>The current study focuses on the construction of classifier ...[more]

PMID: 24967636

Similar Datasets

Project description:BackgroundConfounding is a common issue in epidemiological research. Commonly used confounder-adjustment methods include multivariable regression analysis and propensity score methods. Although it is common practice to assess the linearity assumption for the exposure-outcome effect, most researchers do not assess linearity of the relationship between the confounder and the exposure and between the confounder and the outcome before adjusting for the confounder in the analysis. Failing to take the true non-linear functional form of the confounder-exposure and confounder-outcome associations into account may result in an under- or overestimation of the true exposure effect. Therefore, this paper aims to demonstrate the importance of assessing the linearity assumption for confounder-exposure and confounder-outcome associations and the importance of correctly specifying these associations when the linearity assumption is violated.MethodsA Monte Carlo simulation study was used to assess and compare the performance of confounder-adjustment methods when the functional form of the confounder-exposure and confounder-outcome associations were misspecified (i.e., linearity was wrongly assumed) and correctly specified (i.e., linearity was rightly assumed) under multiple sample sizes. An empirical data example was used to illustrate that the misspecification of confounder-exposure and confounder-outcome associations leads to bias.ResultsThe simulation study illustrated that the exposure effect estimate will be biased when for propensity score (PS) methods the confounder-exposure association is misspecified. For methods in which the outcome is regressed on the confounder or the PS, the exposure effect estimate will be biased if the confounder-outcome association is misspecified. In the empirical data example, correct specification of the confounder-exposure and confounder-outcome associations resulted in smaller exposure effect estimates.ConclusionWhen attempting to remove bias by adjusting for confounding, misspecification of the confounder-exposure and confounder-outcome associations might actually introduce bias. It is therefore important that researchers not only assess the linearity of the exposure-outcome effect, but also of the confounder-exposure or confounder-outcome associations depending on the confounder-adjustment method used.

Dataset Information

Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation.

Background

Focus

Data

Methods

Publications

Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets