Dataset Information

Optimally splitting cases for training and testing high dimensional classifiers.

ABSTRACT:

Background

We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate?

Results

We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts.

Conclusions

By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ? 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determining the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.

SUBMITTER: Dobbin KK

PROVIDER: S-EPMC3090739 | biostudies-literature | 2011 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Optimally splitting cases for training and testing high dimensional classifiers.

Dobbin Kevin K KK Simon Richard M RM

BMC medical genomics 20110408

<h4>Background</h4>We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the predi ...[more]

PMID: 21477282

Dataset Information

Optimally splitting cases for training and testing high dimensional classifiers.

Background

Results

Conclusions

Publications

Optimally splitting cases for training and testing high dimensional classifiers.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Probabilistic classifiers with high-dimensional data.
| S-EPMC3138069 | biostudies-literature

Improved shrunken centroid classifiers for high-dimensional class-imbalanced data.
| S-EPMC3687811 | biostudies-other

The role of balanced training and testing data sets for binary classifiers in bioinformatics.
| S-EPMC3706434 | biostudies-literature

Testing optimally weighted combination of variants for hypertension.
| S-EPMC4143713 | biostudies-literature

HYPOTHESIS TESTING FOR HIGH-DIMENSIONAL SPARSE BINARY REGRESSION.
| S-EPMC4522432 | biostudies-literature

Testing Mediation Effects in High-Dimensional Epigenetic Studies.
| S-EPMC6883258 | biostudies-literature

ASYMPTOTICALLY INDEPENDENT U-STATISTICS IN HIGH-DIMENSIONAL TESTING.
| S-EPMC8634550 | biostudies-literature

Estimation and Inference for High Dimensional Generalized Linear Models: A Splitting and Smoothing Approach.
| S-EPMC8442657 | biostudies-literature

Sample size requirements for training high-dimensional risk predictors.
| S-EPMC3770001 | biostudies-literature

LINEAR HYPOTHESIS TESTING FOR HIGH DIMENSIONAL GENERALIZED LINEAR MODELS.
| S-EPMC6750760 | biostudies-literature