Unknown

Dataset Information

0

Predicting protein function and other biomedical characteristics with heterogeneous ensembles.


ABSTRACT: Prediction problems in biomedical sciences, including protein function prediction (PFP), are generally quite difficult. This is due in part to incomplete knowledge of the cellular phenomenon of interest, the appropriateness and data quality of the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor for specific problems. In such scenarios, a powerful approach to improving prediction performance is to construct heterogeneous ensemble predictors that combine the output of diverse individual predictors that capture complementary aspects of the problems and/or datasets. In this paper, we demonstrate the potential of such heterogeneous ensembles, derived from stacking and ensemble selection methods, for addressing PFP and other similar biomedical prediction problems. Deeper analysis of these results shows that the superior predictive ability of these methods, especially stacking, can be attributed to their attention to the following aspects of the ensemble learning process: (i) better balance of diversity and performance, (ii) more effective calibration of outputs and (iii) more robust incorporation of additional base predictors. Finally, to make the effective application of heterogeneous ensembles to large complex datasets (big data) feasible, we present DataSink, a distributed ensemble learning framework, and demonstrate its sound scalability using the examined datasets. DataSink is publicly available from https://github.com/shwhalen/datasink.

SUBMITTER: Whalen S 

PROVIDER: S-EPMC4718788 | biostudies-literature | 2016 Jan

REPOSITORIES: biostudies-literature

altmetric image

Publications

Predicting protein function and other biomedical characteristics with heterogeneous ensembles.

Whalen Sean S   Pandey Om Prakash OP   Pandey Gaurav G  

Methods (San Diego, Calif.) 20150902


Prediction problems in biomedical sciences, including protein function prediction (PFP), are generally quite difficult. This is due in part to incomplete knowledge of the cellular phenomenon of interest, the appropriateness and data quality of the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor for specific problems. In such scenarios, a powerful approach to improving prediction performance is to construct heterogeneous ensemble predic  ...[more]

Similar Datasets

| S-EPMC6221071 | biostudies-literature
| S-EPMC5500862 | biostudies-literature
| S-EPMC3794900 | biostudies-literature
| S-EPMC1562382 | biostudies-literature
| S-EPMC6368809 | biostudies-other
| S-EPMC3474852 | biostudies-literature
| S-EPMC6605767 | biostudies-literature
| S-EPMC7296347 | biostudies-literature
| S-EPMC10350086 | biostudies-literature
| S-EPMC5489166 | biostudies-literature