Browse
Submit Data
Databases
API
Help

Dataset Information

14 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Accuracy in the prediction of disease epidemics when ensembling simple but highly correlated models.

ABSTRACT: Ensembling combines the predictions made by individual component base models with the goal of achieving a predictive accuracy that is better than that of any one of the constituent member models. Diversity among the base models in terms of predictions is a crucial criterion in ensembling. However, there are practical instances when the available base models produce highly correlated predictions, because they may have been developed within the same research group or may have been built from the same underlying algorithm. We investigated, via a case study on Fusarium head blight (FHB) on wheat in the U.S., whether ensembles of simple yet highly correlated models for predicting the risk of FHB epidemics, all generated from logistic regression, provided any benefit to predictive performance, despite relatively low levels of base model diversity. Three ensembling methods were explored: soft voting, weighted averaging of smaller subsets of the base models, and penalized regression as a stacking algorithm. Soft voting and weighted model averages were generally better at classification than the base models, though not universally so. The performances of stacked regressions were superior to those of the other two ensembling methods we analyzed in this study. Ensembling simple yet correlated models is computationally feasible and is therefore worth pursuing for models of epidemic risk.

SUBMITTER: Shah DA

PROVIDER: S-EPMC7993824 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

Quantitative Toxicity Prediction via Meta Ensembling of Multitask Deep Learning Models.

Project description:Toxicity prediction using quantitative structure-activity relationship has achieved significant progress in recent years. However, most existing machine learning methods in toxicity prediction utilize only one type of feature representation and one type of neural network, which essentially restricts their performance. Moreover, methods that use more than one type of feature representation struggle with the aggregation of information captured within the features since they use predetermined aggregation formulas. In this paper, we propose a deep learning framework for quantitative toxicity prediction using five individual base deep learning models and their own base feature representations. We then propose to adopt a meta ensemble approach using another separate deep learning model to perform aggregation of the outputs of the individual base deep learning models. We train our deep learning models in a weighted multitask fashion combining four quantitative toxicity data sets of LD50, IGC50, LC50, and LC50-DM and minimizing the root-mean-square errors. Compared to the current state-of-the-art toxicity prediction method TopTox on LD50, IGC50, and LC50-DM, that is, three out of four data sets, our method, respectively, obtains 5.46, 16.67, and 6.34% better root-mean-square errors, 6.41, 11.80, and 12.16% better mean absolute errors, and 5.21, 7.36, and 2.54% better coefficients of determination. We named our method QuantitativeTox, and our implementation is available from the GitHub repository https://github.com/Abdulk084/QuantitativeTox.

| S-EPMC8154128 | biostudies-literature

Beware of R(2): Simple, Unambiguous Assessment of the Prediction Accuracy of QSAR and QSPR Models.

Project description:The statistical metrics used to characterize the external predictivity of a model, i.e., how well it predicts the properties of an independent test set, have proliferated over the past decade. This paper clarifies some apparent confusion over the use of the coefficient of determination, R(2), as a measure of model fit and predictive power in QSAR and QSPR modeling. R(2) (or r(2)) has been used in various contexts in the literature in conjunction with training and test data for both ordinary linear regression and regression through the origin as well as with linear and nonlinear regression models. We analyze the widely adopted model fit criteria suggested by Golbraikh and Tropsha ( J. Mol. Graphics Modell. 2002 , 20 , 269 - 276 ) in a strict statistical manner. Shortcomings in these criteria are identified, and a clearer and simpler alternative method to characterize model predictivity is provided. The intent is not to repeat the well-documented arguments for model validation using test data but rather to guide the application of R(2) as a model fit statistic. Examples are used to illustrate both correct and incorrect uses of R(2). Reporting the root-mean-square error or equivalent measures of dispersion, which are typically of more practical importance than R(2), is also encouraged, and important challenges in addressing the needs of different categories of users such as computational chemists, experimental scientists, and regulatory decision support specialists are outlined.

| S-EPMC4530125 | biostudies-literature

A simple consensus approach improves somatic mutation prediction accuracy.

Project description:Differentiating true somatic mutations from artifacts in massively parallel sequencing data is an immense challenge. To develop methods for optimal somatic mutation detection and to identify factors influencing somatic mutation prediction accuracy, we validated predictions from three somatic mutation detection algorithms, MuTect, JointSNVMix2 and SomaticSniper, by Sanger sequencing. Full consensus predictions had a validation rate of >98%, but some partial consensus predictions validated too. In cases of partial consensus, read depth and mapping quality data, along with additional prediction methods, aided in removing inaccurate predictions. Our consensus approach is fast, flexible and provides a high-confidence list of putative somatic mutations.

| S-EPMC3978449 | biostudies-literature

Ensembling Low Precision Models for Binary Biomedical Image Segmentation.

Project description:Segmentation of anatomical regions of interest such as vessels or small lesions in medical images is still a difficult problem that is often tackled with manual input by an expert. One of the major challenges for this task is that the appearance of foreground (positive) regions can be similar to background (negative) regions. As a result, many automatic segmentation algorithms tend to exhibit asymmetric errors, typically producing more false positives than false negatives. In this paper, we aim to leverage this asymmetry and train a diverse ensemble of models with very high recall, while sacrificing their precision. Our core idea is straightforward: A diverse ensemble of low precision and high recall models are likely to make different false positive errors (classifying background as foreground in different parts of the image), but the true positives will tend to be consistent. Thus, in aggregate the false positive errors will cancel out, yielding high performance for the ensemble. Our strategy is general and can be applied with any segmentation model. In three different applications (carotid artery segmentation in a neck CT angiography, myocardium segmentation in a cardiovascular MRI and multiple sclerosis lesion segmentation in a brain MRI), we show how the proposed approach can significantly boost the performance of a baseline segmentation method.

| S-EPMC11228952 | biostudies-literature

Improving disaggregation models of malaria incidence by ensembling non-linear models of prevalence.

Project description:Maps of disease burden are a core tool needed for the control and elimination of malaria. Reliable routine surveillance data of malaria incidence, typically aggregated to administrative units, is becoming more widely available. Disaggregation regression is an important model framework for estimating high resolution risk maps from aggregated data. However, the aggregation of incidence over large, heterogeneous areas means that these data are underpowered for estimating complex, non-linear models. In contrast, prevalence point-surveys are directly linked to local environmental conditions but are not common in many areas of the world. Here, we train multiple non-linear, machine learning models on Plasmodium falciparum prevalence point-surveys. We then ensemble the predictions from these machine learning models with a disaggregation regression model that uses aggregated malaria incidences as response data. We find that using a disaggregation regression model to combine predictions from machine learning models improves model accuracy relative to a baseline model.

| S-EPMC9205339 | biostudies-literature

The predictive accuracy of coronary heart disease risk prediction models in rural Northwestern China.

Project description:Cardiovascular risk models developed may have limitations when applied to rural Chinese. This study validated and compared the Framingham Risk Score (FRS) and Prediction for Atherosclerotic Cardiovascular Disease Risk in China (PAR) models in predicting 10-year risk of coronary heart disease (CHD) in a rural cohort in Ningxia, China from 2008 to 2019. The FRS and PAR models were validated by estimating predicted events, C index, calibration χ2 and plots. 1381 adults without CHD at baseline were followed up for 9.75 years on average. 168 CHD cases were observed. The FRS and PAR underestimated CHD events by 22 % and 46 % for the total population, while overestimated for males by 152 % and 78 %, respectively. The C index was slightly higher for PAR than FRS. Both models showed weak calibration with chi-square values above 20 (p < 0.001). Bland-Altman plots indicated FRS predicted higher CHD risk than PAR, lacking consistency. Overall, FRS and PAR demonstrated limited performance in predicting 10-year CHD risk in this rural population. PAR had slightly better discrimination than FRS, but require further improvement in calibration and individual risk estimation to suit the rural population in Northwest China.

| S-EPMC10728432 | biostudies-literature

Prediction of infectious disease epidemics via weighted density ensembles.

Project description:Accurate and reliable predictions of infectious disease dynamics can be valuable to public health organizations that plan interventions to decrease or prevent disease transmission. A great variety of models have been developed for this task, using different model structures, covariates, and targets for prediction. Experience has shown that the performance of these models varies; some tend to do better or worse in different seasons or at different points within a season. Ensemble methods combine multiple models to obtain a single prediction that leverages the strengths of each model. We considered a range of ensemble methods that each form a predictive density for a target of interest as a weighted sum of the predictive densities from component models. In the simplest case, equal weight is assigned to each component model; in the most complex case, the weights vary with the region, prediction target, week of the season when the predictions are made, a measure of component model uncertainty, and recent observations of disease incidence. We applied these methods to predict measures of influenza season timing and severity in the United States, both at the national and regional levels, using three component models. We trained the models on retrospective predictions from 14 seasons (1997/1998-2010/2011) and evaluated each model's prospective, out-of-sample performance in the five subsequent influenza seasons. In this test phase, the ensemble methods showed average performance that was similar to the best of the component models, but offered more consistent performance across seasons than the component models. Ensemble methods offer the potential to deliver more reliable predictions to public health decision makers.

| S-EPMC5834190 | biostudies-literature

Complex versus simple models: ion-channel cardiac toxicity prediction.

Project description:There is growing interest in applying detailed mathematical models of the heart for ion-channel related cardiac toxicity prediction. However, a debate as to whether such complex models are required exists. Here an assessment in the predictive performance between two established large-scale biophysical cardiac models and a simple linear model Bnet was conducted. Three ion-channel data-sets were extracted from literature. Each compound was designated a cardiac risk category using two different classification schemes based on information within CredibleMeds. The predictive performance of each model within each data-set for each classification scheme was assessed via a leave-one-out cross validation. Overall the Bnet model performed equally as well as the leading cardiac models in two of the data-sets and outperformed both cardiac models on the latest. These results highlight the importance of benchmarking complex versus simple models but also encourage the development of simple models.

| S-EPMC5804316 | biostudies-literature

Accuracy of structure-derived properties in simple comparative models of protein structures.

Project description:The accuracy of comparative models of proteins is addressed here. A set of 12,732 single-template models of sequences of known high-resolution structures was built by an automated procedure. Accuracy of several structure-derived properties, such as surface area, residue accessibility, presence of pockets, electrostatic potential and others, was determined as a function of template:target sequence identity by comparing models with their corresponding experimental structures. As expected, the average accuracy of structure-derived properties always increases with higher template:target sequence identity, but the exact shape of this relationship can differ from one property to another. A comparison of structure-derived properties measured from NMR and X-ray structures of the same protein shows that for most properties, the NMR/X-ray difference is of the same order as the error in models based on approximately 40% template:target sequence identity. The exact sequence identity at which properties reach that accuracy varies between 25 and 50%, depending on the property being analyzed. A general characteristic of simple comparative models is that their surface has increased area as a consequence of being more rugged than that of experimental structures. This suggests that including solvent effects during model building or refinement could significantly improve the accuracy of surface properties in comparative models.

| S-EPMC546150 | biostudies-literature

The whole is greater than its parts: ensembling improves protein contact prediction.

Project description:The prediction of amino acid contacts from protein sequence is an important problem, as protein contacts are a vital step towards the prediction of folded protein structures. We propose that a powerful concept from deep learning, called ensembling, can increase the accuracy of protein contact predictions by combining the outputs of different neural network models. We show that ensembling the predictions made by different groups at the recent Critical Assessment of Protein Structure Prediction (CASP13) outperforms all individual groups. Further, we show that contacts derived from the distance predictions of three additional deep neural networks-AlphaFold, trRosetta, and ProSPr-can be substantially improved by ensembling all three networks. We also show that ensembling these recent deep neural networks with the best CASP13 group creates a superior contact prediction tool. Finally, we demonstrate that two ensembled networks can successfully differentiate between the folds of two highly homologous sequences. In order to build further on these findings, we propose the creation of a better protein contact benchmark set and additional open-source contact prediction methods.

| S-EPMC8044223 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data