Dataset Information

Improved small-sample estimation of nonlinear cross-validated prediction metrics.

ABSTRACT: When predicting an outcome is the scientific goal, one must decide on a metric by which to evaluate the quality of predictions. We consider the problem of measuring the performance of a prediction algorithm with the same data that were used to train the algorithm. Typical approaches involve bootstrapping or cross-validation. However, we demonstrate that bootstrap-based approaches often fail and standard cross-validation estimators may perform poorly. We provide a general study of cross-validation-based estimators that highlights the source of this poor performance, and propose an alternative framework for estimation using techniques from the efficiency theory literature. We provide a theorem establishing the weak convergence of our estimators. The general theorem is applied in detail to two specific examples and we discuss possible extensions to other parameters of interest. For the two explicit examples that we consider, our estimators demonstrate remarkable finite-sample improvements over standard approaches.

SUBMITTER: Benkeser D

PROVIDER: S-EPMC7954141 | biostudies-literature | 2020

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Improved small-sample estimation of nonlinear cross-validated prediction metrics.

Benkeser David D Petersen Maya M van der Laan Mark J MJ

Journal of the American Statistical Association 20191021 532

When predicting an outcome is the scientific goal, one must decide on a metric by which to evaluate the quality of predictions. We consider the problem of measuring the performance of a prediction algorithm with the same data that were used to train the algorithm. Typical approaches involve bootstrapping or cross-validation. However, we demonstrate that bootstrap-based approaches often fail and standard cross-validation estimators may perform poorly. We provide a general study of cross-validatio ...[more]

PMID: 33716360

Similar Datasets

Project description:Software is a complex entity, and its development needs careful planning and a high amount of time and cost. To assess quality of program, software measures are very helpful. Amongst the existing measures, coupling is an important design measure, which computes the degree of interdependence among the entities of a software system. Higher coupling leads to cognitive complexity and thus a higher probability occurrence of faults. Well in time prediction of fault-prone modules assists in saving time and cost of testing. This paper aims to capture important aspects of coupling and then assess the effectiveness of these aspects in determining fault-prone entities in the software system. We propose two coupling metrics, i.e., Vovel-in and Vovel-out, that capture the level of coupling and the volume of information flow. We empirically evaluate the effectiveness of the Vovel metrics in determining the fault-prone classes using five projects, i.e., Eclipse JDT, Equinox framework, Apache Lucene, Mylyn, and Eclipse PDE UI. Model building is done using univariate logistic regression and later Spearman correlation coefficient is computed with the existing coupling metrics to assess the coverage of unique information. Finally, the least correlated metrics are used for building multivariate logistic regression with and without the use of Vovel metrics, to assess the effectiveness of Vovel metrics. The results show the proposed metrics significantly improve the predicting of fault prone classes. Moreover, the proposed metrics cover a significant amount of unique information which is not covered by the existing well-known coupling metrics, i.e., CBO, RFC, Fan-in, and Fan-out. This paper, empirically evaluates the impact of coupling metrics, and more specifically the importance of level and volume of coupling in software fault prediction. The results advocate the prudent addition of proposed metrics due to their unique information coverage and significant predictive ability.

Project description:BackgroundSupervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than random guessing. A suggested alternative is to obtain a confidence interval of the error rate using repeated design and test sets selected from available examples. However, it is known that even in the ideal situation of repeated designs and tests with completely novel samples in each cycle, a small test set size leads to a large bias in the estimate of the true variance between design sets. Therefore different methods for small sample performance estimation such as a recently proposed procedure called Repeated Random Sampling (RSS) is also expected to result in heavily biased estimates, which in turn translates into biased confidence intervals. Here we explore such biases and develop a refined algorithm called Repeated Independent Design and Test (RIDT).ResultsOur simulations reveal that repeated designs and tests based on resampling in a fixed bag of samples yield a biased variance estimate. We also demonstrate that it is possible to obtain an improved variance estimate by means of a procedure that explicitly models how this bias depends on the number of samples used for testing. For the special case of repeated designs and tests using new samples for each design and test, we present an exact analytical expression for how the expected value of the bias decreases with the size of the test set.ConclusionWe show that via modeling and subsequent reduction of the small sample bias, it is possible to obtain an improved estimate of the variance of classifier performance between design sets. However, the uncertainty of the variance estimate is large in the simulations performed indicating that the method in its present form cannot be directly applied to small data sets.

Project description:BackgroundCytokines act by binding to specific receptors in the plasma membrane of target cells. Knowledge of cytokine-receptor interaction (CRI) is very important for understanding the pathogenesis of various human diseases-notably autoimmune, inflammatory and infectious diseases-and identifying potential therapeutic targets. Recently, machine learning algorithms have been used to predict CRIs. "Gold Standard" negative datasets are still lacking and strong biases in negative datasets can significantly affect the training of learning algorithms and their evaluation. To mitigate the unrepresentativeness and bias inherent in the negative sample selection (non-interacting proteins), we propose a clustering-based approach for representative negative sample selection.ResultsWe used deep autoencoders to investigate the effect of different sampling approaches for non-interacting pairs on the training and the performance of machine learning classifiers. By using the anomaly detection capabilities of deep autoencoders we deduced the effects of different categories of negative samples on the training of learning algorithms. Random sampling for selecting non-interacting pairs results in either over- or under-representation of hard or easy to classify instances. When K-means based sampling of negative datasets is applied to mitigate the inadequacies of random sampling, random forest (RF) together with the combined feature set of atomic composition, physicochemical-2grams and two different representations of evolutionary information performs best. Average model performances based on leave-one-out cross validation (loocv) over ten different negative sample sets that each model was trained with, show that RF models significantly outperform the previous best CRI predictor in terms of accuracy (+ 5.1%), specificity (+ 13%), mcc (+ 0.1) and g-means value (+ 5.1). Evaluations using tenfold cv and training/testing splits confirm the competitive performance.ConclusionsA comparative analysis was performed to assess the effect of three different sampling methods (random, K-means and uniform sampling) on the training of learning algorithms using different evaluation methods. Models trained on K-means sampled datasets generally show a significantly improved performance compared to those trained on random selections-with RF seemingly benefiting most in our particular setting. Our findings on the sampling are highly relevant and apply to many applications of supervised learning approaches in bioinformatics.

Dataset Information

Improved small-sample estimation of nonlinear cross-validated prediction metrics.

Publications

Improved small-sample estimation of nonlinear cross-validated prediction metrics.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets