Dataset Information

Performance of Regression Models as a Function of Experiment Noise.

ABSTRACT:

Background

A challenge in developing machine learning regression models is that it is difficult to know whether maximal performance has been reached on the test dataset, or whether further model improvement is possible. In biology, this problem is particularly pronounced as sample labels (response variables) are typically obtained through experiments and therefore have experiment noise associated with them. Such label noise puts a fundamental limit to the metrics of performance attainable by regression models on the test dataset.

Results

We address this challenge by deriving an expected upper bound for the coefficient of determination (R ²) for regression models when tested on the holdout dataset. This upper bound depends only on the noise associated with the response variable in a dataset as well as its variance. The upper bound estimate was validated via Monte Carlo simulations and then used as a tool to bootstrap performance of regression models trained on biological datasets, including protein sequence data, transcriptomic data, and genomic data.

Conclusions

The new method for estimating upper bounds for model performance on test data should aid researchers in developing ML regression models that reach their maximum potential. Although we study biological datasets in this work, the new upper bound estimates will hold true for regression models from any research field or application area where response variables have associated noise.

SUBMITTER: Li G

PROVIDER: S-EPMC8243133 | biostudies-literature | 2021

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Performance of Regression Models as a Function of Experiment Noise.

Li Gang G Zrimec Jan J Ji Boyang B Geng Jun J Larsbrink Johan J Zelezniak Aleksej A Nielsen Jens J Engqvist Martin Km MK

Bioinformatics and biology insights 20210627

<h4>Background</h4>A challenge in developing machine learning regression models is that it is difficult to know whether maximal performance has been reached on the test dataset, or whether further model improvement is possible. In biology, this problem is particularly pronounced as sample labels (response variables) are typically obtained through experiments and therefore have experiment noise associated with them. Such label noise puts a fundamental limit to the metrics of performance attainabl ...[more]

PMID: 34262264

Similar Datasets

Project description:ObjectiveAfter deploying a clinical prediction model, subsequently collected data can be used to fine-tune its predictions and adapt to temporal shifts. Because model updating carries risks of over-updating/fitting, we study online methods with performance guarantees.Materials and methodsWe introduce 2 procedures for continual recalibration or revision of an underlying prediction model: Bayesian logistic regression (BLR) and a Markov variant that explicitly models distribution shifts (MarBLR). We perform empirical evaluation via simulations and a real-world study predicting Chronic Obstructive Pulmonary Disease (COPD) risk. We derive "Type I and II" regret bounds, which guarantee the procedures are noninferior to a static model and competitive with an oracle logistic reviser in terms of the average loss.ResultsBoth procedures consistently outperformed the static model and other online logistic revision methods. In simulations, the average estimated calibration index (aECI) of the original model was 0.828 (95%CI, 0.818-0.938). Online recalibration using BLR and MarBLR improved the aECI towards the ideal value of zero, attaining 0.265 (95%CI, 0.230-0.300) and 0.241 (95%CI, 0.216-0.266), respectively. When performing more extensive logistic model revisions, BLR and MarBLR increased the average area under the receiver-operating characteristic curve (aAUC) from 0.767 (95%CI, 0.765-0.769) to 0.800 (95%CI, 0.798-0.802) and 0.799 (95%CI, 0.797-0.801), respectively, in stationary settings and protected against substantial model decay. In the COPD study, BLR and MarBLR dynamically combined the original model with a continually refitted gradient boosted tree to achieve aAUCs of 0.924 (95%CI, 0.913-0.935) and 0.925 (95%CI, 0.914-0.935), compared to the static model's aAUC of 0.904 (95%CI, 0.892-0.916).DiscussionDespite its simplicity, BLR is highly competitive with MarBLR. MarBLR outperforms BLR when its prior better reflects the data.ConclusionsBLR and MarBLR can improve the transportability of clinical prediction models and maintain their performance over time.

Dataset Information

Performance of Regression Models as a Function of Experiment Noise.

Background

Results

Conclusions

Publications

Performance of Regression Models as a Function of Experiment Noise.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets