Unknown

Dataset Information

0

Cross-validation under separate sampling: strong bias and how to correct it.


ABSTRACT:

Motivation

It is commonly assumed in pattern recognition that cross-validation error estimation is 'almost unbiased' as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics.

Results

We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an 'almost unbiased' theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used.

Availability and implementation

The source code in C++, along with the Supplementary Materials, is available at: http://gsp.tamu.edu/Publications/supplementary/zollanvari13/.

SUBMITTER: Braga-Neto UM 

PROVIDER: S-EPMC4296143 | biostudies-literature | 2014 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

Cross-validation under separate sampling: strong bias and how to correct it.

Braga-Neto Ulisses M UM   Zollanvari Amin A   Dougherty Edward R ER  

Bioinformatics (Oxford, England) 20140813 23


<h4>Motivation</h4>It is commonly assumed in pattern recognition that cross-validation error estimation is 'almost unbiased' as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics.<h4>Results</h4>We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depe  ...[more]

Similar Datasets

| S-EPMC6636226 | biostudies-literature
| S-EPMC4072626 | biostudies-literature
| S-EPMC6156650 | biostudies-literature
| S-EPMC7304018 | biostudies-literature
| S-EPMC1890303 | biostudies-literature
2024-01-09 | GSE249022 | GEO
| S-EPMC9238114 | biostudies-literature
| S-EPMC3280953 | biostudies-literature
| S-EPMC3951147 | biostudies-literature
| S-EPMC1570370 | biostudies-literature