Dataset Information

Evaluating parameters for ligand-based modeling with random forest on sparse data sets.

ABSTRACT: Ligand-based predictive modeling is widely used to generate predictive models aiding decision making in e.g. drug discovery projects. With growing data sets and requirements on low modeling time comes the necessity to analyze data sets efficiently to support rapid and robust modeling. In this study we analyzed four data sets and studied the efficiency of machine learning methods on sparse data structures, utilizing Morgan fingerprints of different radii and hash sizes, and compared with molecular signatures descriptor of different height. We specifically evaluated the effect these parameters had on modeling time, predictive performance, and memory requirements using two implementations of random forest; Scikit-learn as well as FEST. We also compared with a support vector machine implementation. Our results showed that unhashed fingerprints yield significantly better accuracy than hashed fingerprints ([Formula: see text]), with no pronounced deterioration in modeling time and memory usage. Furthermore, the fast execution and low memory usage of the FEST algorithm suggest that it is a good alternative for large, high dimensional sparse data. Both support vector machines and random forest performed equally well but results indicate that the support vector machine was better at using the extra information from larger values of the Morgan fingerprint's radius.

SUBMITTER: Kensert A

PROVIDER: S-EPMC6755600 | biostudies-literature | 2018 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Evaluating parameters for ligand-based modeling with random forest on sparse data sets.

Kensert Alexander A Alvarsson Jonathan J Norinder Ulf U Spjuth Ola O

Journal of cheminformatics 20181011 1

Ligand-based predictive modeling is widely used to generate predictive models aiding decision making in e.g. drug discovery projects. With growing data sets and requirements on low modeling time comes the necessity to analyze data sets efficiently to support rapid and robust modeling. In this study we analyzed four data sets and studied the efficiency of machine learning methods on sparse data structures, utilizing Morgan fingerprints of different radii and hash sizes, and compared with molecula ...[more]

PMID: 30306349

Similar Datasets

Project description:To perform parametric identification of mathematical models of biological events, experimental data are rare to be sufficient to estimate target behaviors produced by complex non-linear systems. We performed parameter fitting to a cell cycle model with experimental data as an in silico experiment. We calibrated model parameters with the generalized least squares method with randomized initial values and checked local and global sensitivity of the model. Sensitivity analyses showed that parameter optimization induced less sensitivity except for those related to the metabolism of the transcription factors c-Myc and E2F, which are required to overcome a restriction point (R-point). We performed bifurcation analyses with the optimized parameters and found the bimodality was lost. This result suggests that accumulation of c-Myc and E2F induced dysfunction of R-point. We performed a second parameter optimization based on the results of sensitivity analyses and incorporating additional derived from recent in vivo data. This optimization returned the bimodal characteristics of the model with a narrower range of hysteresis than the original. This result suggests that the optimized model can more easily go through R-point and come back to the gap phase after once having overcome it. Two parameter space analyses showed metabolism of c-Myc is transformed as it can allow cell bimodal behavior with weak stimuli of growth factors. This result is compatible with the character of the cell line used in our experiments. At the same time, Rb, an inhibitor of E2F, can allow cell bimodal behavior with only a limited range of stimuli when it is activated, but with a wider range of stimuli when it is inactive. These results provide two insights; biologically, the two transcription factors play an essential role in malignant cells to overcome R-point with weaker growth factor stimuli, and theoretically, sparse time-course data can be used to change a model to a biologically expected state.

Dataset Information

Evaluating parameters for ligand-based modeling with random forest on sparse data sets.

Publications

Evaluating parameters for ligand-based modeling with random forest on sparse data sets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets