Browse
Submit Data
Databases
API
Help

Dataset Information

23 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Comparison of Descriptor- and Fingerprint Sets in Machine Learning Models for ADME-Tox Targets.

ABSTRACT: The screening of compounds for ADME-Tox targets plays an important role in drug design. QSPR models can increase the speed of these specific tasks, although the performance of the models highly depends on several factors, such as the applied molecular descriptors. In this study, a detailed comparison of the most popular descriptor groups has been carried out for six main ADME-Tox classification targets: Ames mutagenicity, P-glycoprotein inhibition, hERG inhibition, hepatotoxicity, blood-brain-barrier permeability, and cytochrome P450 2C9 inhibition. The literature-based, medium-sized binary classification datasets (all above 1,000 molecules) were used for the model building by two common algorithms, XGBoost and the RPropMLP neural network. Five molecular representation sets were compared along with their joint applications: Morgan, Atompairs, and MACCS fingerprints, and the traditional 1D and 2D molecular descriptors, as well as 3D molecular descriptors, separately. The statistical evaluation of the model performances was based on 18 different performance parameters. Although all the developed models were close to the usual performance of QSPR models for each specific ADME-Tox target, the results clearly showed the superiority of the traditional 1D, 2D, and 3D descriptors in the case of the XGBoost algorithm. It is worth trying the classical tools in single model building because the use of 2D descriptors can produce even better models for almost every dataset than the combination of all the examined descriptor sets.

SUBMITTER: Orosz A

PROVIDER: S-EPMC9214226 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

Prediction of kinase inhibitors binding modes with machine learning and reduced descriptor sets.

Project description:Protein kinases are receiving wide research interest, from drug perspective, due to their important roles in human body. Available kinase-inhibitor data, including crystallized structures, revealed many details about the mechanism of inhibition and binding modes. The understanding and analysis of these binding modes are expected to support the discovery of kinase-targeting drugs. The huge amounts of data made it possible to utilize computational techniques, including machine learning, to help in the discovery of kinase-targeting drugs. Machine learning gave reasonable predictions when applied to differentiate between the binding modes of kinase inhibitors, promoting a wider application in that domain. In this study, we applied machine learning supported by feature selection techniques to classify kinase inhibitors according to their binding modes. We represented inhibitors as a large number of molecular descriptors, as features, and systematically reduced these features in a multi-step manner while trying to attain high classification accuracy. Our predictive models could satisfy both goals by achieving high accuracy while utilizing at most 5% of the modeling features. The models could differentiate between binding mode types with MCC values between 0.67 and 0.92, and balanced accuracy values between 0.78 and 0.97 for independent test sets.

| S-EPMC7804204 | biostudies-literature

Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models.

Project description:A reliable and practical determination of a chemical species' solubility in water continues to be examined using empirical observations and exhaustive experimental studies alone. Predictions of chemical solubility in water using data-driven algorithms can allow us to create a rationally designed, efficient, and cost-effective tool for next-generation materials and chemical formulations. We present results from two machine learning (ML) modeling studies to adequately predict various species' solubility using data for over 8400 compounds. Molecular-descriptors, the most used method in previous studies, and Morgan fingerprint, a circular-based hash of the molecules' structures, were applied to produce water solubility estimates. We trained all models on 80% of the total datasets using the Random Forest (RFs) technique as the regressor and tested the prediction performance using the remaining 20%, resulting in coefficient of determination (R2) test values of 0.88 and 0.81 and root-mean-square deviation (RMSE) test values 0.64 and 0.80 for the descriptors and circular fingerprint methods, respectively. We interpreted the produced ML models and reported the most effective features for aqueous solubility measures using the Shapley Additive exPlanations (SHAP) and thermodynamic analysis. Low error, ability to investigate the molecular-level interactions, and compatibility with thermodynamic quantities made the fingerprint method a distinct model compared to other available computational tools. However, it is worth emphasizing that physicochemical descriptor model outperformed the fingerprint model in achieving better predictive accuracy for the given test set.

| S-EPMC10583449 | biostudies-literature

Applications of the microphysiology systems database for experimental ADME-Tox and disease models.

Project description:To accelerate the development and application of Microphysiological Systems (MPS) in biomedical research and drug discovery/development, a centralized resource is required to provide the detailed design, application, and performance data that enables industry and research scientists to select, optimize, and/or develop new MPS solutions, as well as to harness data from MPS models. We have previously implemented an open source Microphysiology Systems Database (MPS-Db), with a simple icon driven interface, as a resource for MPS researchers and drug discovery/development scientists (https://mps.csb.pitt.edu). The MPS-Db captures and aggregates data from MPS, ranging from static microplate models to integrated, multi-organ microfluidic models, and associates those data with reference data from chemical, biochemical, pre-clinical, clinical and post-marketing sources to support the design, development, validation, application and interpretation of the models. The MPS-Db enables users to manage their multifactor, multichip studies, then upload, analyze, review, computationally model and share data. Here we discuss how the sharing of MPS study data in the MS-Db is under user control and can be kept private to the individual user, shared with a select group of collaborators, or be made accessible to the general scientific community. We also present a test case using our liver acinus MPS model (LAMPS) as an example and discuss the use of the MPS-Db in managing, designing, and analyzing MPS study data, assessing the reproducibility of MPS models, and evaluating the concordance of MPS model results with clinical findings. We introduce the Disease Portal module with links to resources for the design of MPS disease models and studies and discuss the integration of computational models for the prediction of PK/PD and disease pathways using data generated from MPS models.

| S-EPMC7497411 | biostudies-literature

Open Source Bayesian Models. 1. Application to ADME/Tox and Drug Discovery Datasets.

Project description:On the order of hundreds of absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) models have been described in the literature in the past decade which are more often than not inaccessible to anyone but their authors. Public accessibility is also an issue with computational models for bioactivity, and the ability to share such models still remains a major challenge limiting drug discovery. We describe the creation of a reference implementation of a Bayesian model-building software module, which we have released as an open source component that is now included in the Chemistry Development Kit (CDK) project, as well as implemented in the CDD Vault and in several mobile apps. We use this implementation to build an array of Bayesian models for ADME/Tox, in vitro and in vivo bioactivity, and other physicochemical properties. We show that these models possess cross-validation receiver operator curve values comparable to those generated previously in prior publications using alternative tools. We have now described how the implementation of Bayesian models with FCFP6 descriptors generated in the CDD Vault enables the rapid production of robust machine learning models from public data or the user's own datasets. The current study sets the stage for generating models in proprietary software (such as CDD) and exporting these models in a format that could be run in open source software using CDK components. This work also demonstrates that we can enable biocomputation across distributed private or public datasets to enhance drug discovery.

| S-EPMC4478615 | biostudies-literature

Convolution Comparison Pattern: An Efficient Local Image Descriptor for Fingerprint Liveness Detection.

Project description:We present a new type of local image descriptor which yields binary patterns from small image patches. For the application to fingerprint liveness detection, we achieve rotation invariant image patches by taking the fingerprint segmentation and orientation field into account. We compute the discrete cosine transform (DCT) for these rotation invariant patches and attain binary patterns by comparing pairs of two DCT coefficients. These patterns are summarized into one or more histograms per image. Each histogram comprises the relative frequencies of pattern occurrences. Multiple histograms are concatenated and the resulting feature vector is used for image classification. We name this novel type of descriptor convolution comparison pattern (CCP). Experimental results show the usefulness of the proposed CCP descriptor for fingerprint liveness detection. CCP outperforms other local image descriptors such as LBP, LPQ and WLD on the LivDet 2013 benchmark. The CCP descriptor is a general type of local image descriptor which we expect to prove useful in areas beyond fingerprint liveness detection such as biological and medical image processing, texture recognition, face recognition and iris recognition, liveness detection for face and iris images, and machine vision for surface inspection and material classification.

| S-EPMC4742063 | biostudies-literature

Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets.

Project description:Each year, publicly available databases are updated with new compounds from different research institutions. Positive experimental outcomes are more likely to be reported; therefore, they account for a considerable fraction of these entries. Established publicly available databases such as ChEMBL allow researchers to use information without constrictions and create predictive tools for a broad spectrum of applications in the field of toxicology. Therefore, we investigated the distribution of positive and nonpositive entries within ChEMBL for a set of off-targets and its impact on the performance of classification models when applied to pharmaceutical industry data sets. Results indicate that models trained on publicly available data tend to overpredict positives, and models based on industry data sets predict negatives more often than those built using publicly available data sets. This is strengthened even further by the visualization of the prediction space for a set of 10,000 compounds, which makes it possible to identify regions in the chemical space where predictions converge. Finally, we highlight the utilization of these models for consensus modeling for potential adverse events prediction.

| S-EPMC10445286 | biostudies-literature

Descriptor-augmented machine learning for enzyme-chemical interaction predictions

Project description: Not available

| S-EPMC10915406 | biostudies-literature

Comparison of Machine Learning Models for the Androgen Receptor.

Project description:The androgen receptor (AR) is a target of interest for endocrine disruption research, as altered signaling can affect normal reproductive and neurological development for generations. In an effort to prioritize compounds with alternative methodologies, the U.S. Environmental Protection Agency (EPA) used in vitro data from 11 assays to construct models of AR agonist and antagonist signaling pathways. While these EPA ToxCast AR models require in vitro data to assign a bioactivity score, Bayesian machine learning methods can be used for prospective prediction from molecule structure alone. This approach was applied to multiple types of data corresponding to the EPA's AR signaling pathway with proprietary software, Assay Central. The training performance of all machine learning models, including six other algorithms, was evaluated by internal 5-fold cross-validation statistics. Bayesian machine learning models were also evaluated with external predictions of reference chemicals to compare prediction accuracies to published results from the EPA. The machine learning model group selected for further studies of endocrine disruption consisted of continuous AC50 data from the February 2019 release of ToxCast/Tox21. These efforts demonstrate how machine learning can be used to predict AR-mediated bioactivity and can also be applied to other targets of endocrine disruption.

| S-EPMC8243727 | biostudies-literature

Fatigue Evaluation through Machine Learning and a Global Fatigue Descriptor.

Project description:Research in physiology and sports science has shown that fatigue, a complex psychophysiological phenomenon, has a relevant impact in performance and in the correct functioning of our motricity system, potentially being a cause of damage to the human organism. Fatigue can be seen as a subjective or objective phenomenon. Subjective fatigue corresponds to a mental and cognitive event, while fatigue referred as objective is a physical phenomenon. Despite the fact that subjective fatigue is often undervalued, only a physically and mentally healthy athlete is able to achieve top performance in a discipline. Therefore, we argue that physical training programs should address the preventive assessment of both subjective and objective fatigue mechanisms in order to minimize the risk of injuries. In this context, our paper presents a machine-learning system capable of extracting individual fatigue descriptors (IFDs) from electromyographic (EMG) and heart rate variability (HRV) measurements. Our novel approach, using two types of biosignals so that a global (mental and physical) fatigue assessment is taken into account, reflects the onset of fatigue by implementing a combination of a dimensionless (0-1) global fatigue descriptor (GFD) and a support vector machine (SVM) classifier. The system, based on 9 main combined features, achieves fatigue regime classification performances of 0.82 ± 0.24, ensuring a successful preventive assessment when dangerous fatigue levels are reached. Training data were acquired in a constant work rate test (executed by 14 subjects using a cycloergometry device), where the variable under study (fatigue) gradually increased until the volunteer reached an objective exhaustion state.

| S-EPMC6969995 | biostudies-literature

Fusing dual-event data sets for Mycobacterium tuberculosis machine learning models and their evaluation.

Project description:The search for new tuberculosis treatments continues as we need to find molecules that can act more quickly, be accommodated in multidrug regimens, and overcome ever increasing levels of drug resistance. Multiple large scale phenotypic high-throughput screens against Mycobacterium tuberculosis (Mtb) have generated dose response data, enabling the generation of machine learning models. These models also incorporated cytotoxicity data and were recently validated with a large external data set. A cheminformatics data-fusion approach followed by Bayesian machine learning, Support Vector Machine, or Recursive Partitioning model development (based on publicly available Mtb screening data) was used to compare individual data sets and subsequent combined models. A set of 1924 commercially available molecules with promising antitubercular activity (and lack of relative cytotoxicity to Vero cells) were used to evaluate the predictive nature of the models. We demonstrate that combining three data sets incorporating antitubercular and cytotoxicity data in Vero cells from our previous screens results in external validation receiver operator curve (ROC) of 0.83 (Bayesian or RP Forest). Models that do not have the highest 5-fold cross-validation ROC scores can outperform other models in a test set dependent manner. We demonstrate with predictions for a recently published set of Mtb leads from GlaxoSmithKline that no single machine learning model may be enough to identify compounds of interest. Data set fusion represents a further useful strategy for machine learning construction as illustrated with Mtb. Coverage of chemistry and Mtb target spaces may also be limiting factors for the whole-cell screening data generated to date.

| S-EPMC3910492 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data