Dataset Information

Stratification bias in low signal microarray studies.

ABSTRACT:

Background

When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated.

Results

We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice.

Conclusion

Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.

SUBMITTER: Parker BJ

PROVIDER: S-EPMC2211509 | biostudies-literature | 2007 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Stratification bias in low signal microarray studies.

Parker Brian J BJ Günter Simon S Bedo Justin J

BMC bioinformatics 20070902

<h4>Background</h4>When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and tes ...[more]

PMID: 17764577

Similar Datasets

Project description:BackgroundIonizing radiation is an established carcinogen, but risks from low-dose exposures are controversial. Since the Biological Effects of Ionizing Radiation VII review of the epidemiological data in 2006, many subsequent publications have reported excess cancer risks from low-dose exposures. Our aim was to systematically review these studies to assess the magnitude of the risk and whether the positive findings could be explained by biases.MethodsEligible studies had mean cumulative doses of less than 100 mGy, individualized dose estimates, risk estimates, and confidence intervals (CI) for the dose-response and were published in 2006-2017. We summarized the evidence for bias (dose error, confounding, outcome ascertainment) and its likely direction for each study. We tested whether the median excess relative risk (ERR) per unit dose equals zero and assessed the impact of excluding positive studies with potential bias away from the null. We performed a meta-analysis to quantify the ERR and assess consistency across studies for all solid cancers and leukemia.ResultsOf the 26 eligible studies, 8 concerned environmental, 4 medical, and 14 occupational exposure. For solid cancers, 16 of 22 studies reported positive ERRs per unit dose, and we rejected the hypothesis that the median ERR equals zero (P = .03). After exclusion of 4 positive studies with potential positive bias, 12 of 18 studies reported positive ERRs per unit dose (P = .12). For leukemia, 17 of 20 studies were positive, and we rejected the hypothesis that the median ERR per unit dose equals zero (P = .001), also after exclusion of 5 positive studies with potential positive bias (P = .02). For adulthood exposure, the meta-ERR at 100 mGy was 0.029 (95% CI = 0.011 to 0.047) for solid cancers and 0.16 (95% CI = 0.07 to 0.25) for leukemia. For childhood exposure, the meta-ERR at 100 mGy for leukemia was 2.84 (95% CI = 0.37 to 5.32); there were only two eligible studies of all solid cancers.ConclusionsOur systematic assessments in this monograph showed that these new epidemiological studies are characterized by several limitations, but only a few positive studies were potentially biased away from the null. After exclusion of these studies, the majority of studies still reported positive risk estimates. We therefore conclude that these new epidemiological studies directly support excess cancer risks from low-dose ionizing radiation. Furthermore, the magnitude of the cancer risks from these low-dose radiation exposures was statistically compatible with the radiation dose-related cancer risks of the atomic bomb survivors.

Dataset Information

Stratification bias in low signal microarray studies.

Background

Results

Conclusion

Publications

Stratification bias in low signal microarray studies.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets