Dataset Information

Leveraging heterogeneous data from GHS toxicity annotations, molecular and protein target descriptors and Tox21 assay readouts to predict and rationalise acute toxicity.

ABSTRACT: Despite the increasing knowledge in both the chemical and biological domains the assimilation and exploration of heterogeneous datasets, encoding information about the chemical, bioactivity and phenotypic properties of compounds, remains a challenge due to requirement for overlap between chemicals assayed across the spaces. Here, we have constructed a novel dataset, larger than we have used in prior work, comprising 579 acute oral toxic compounds and 1427 non-toxic compounds derived from regulatory GHS information, along with their corresponding molecular and protein target descriptors and qHTS in vitro assay readouts from the Tox21 project. We found no clear association between the results of a FAFDrugs4 toxicophore screen and the acute oral toxicity classifications for our compound set; and a screen using a subset of the ToxAlerts toxicophores was also of limited utility, with only slight enrichment toward the toxic set (odds ratio of 1.48). We then investigated to what degree toxic and non-toxic compounds could be separated in each of the spaces, to compare their potential contribution to further analyses. Using an LDA projection, we found the largest degree of separation using chemical descriptors (Cohen's d of 1.95) and the lowest degree of separation between toxicity classes using qHTS descriptors (Cohen's d of 0.67). To compare the predictivity of the feature spaces for the toxicity endpoint, we next trained Random Forest (RF) acute oral toxicity classifiers on either molecular, protein target and qHTS descriptors. RFs trained on molecular and protein target descriptors were most predictive, with ROC AUC values of 0.80-0.92 and 0.70-0.85, respectively, across three test sets. RFs trained on both chemical and protein target descriptors combined exhibited similar predictive performance to the single-domain models (ROC AUC of 0.80-0.91). Model interpretability was improved by the inclusion of protein target descriptors, which allow the identification of specific targets (e.g. Retinal dehydrogenase) with literature links to toxic modes of action (e.g. oxidative stress). The dataset compiled in this study has been made available for future application.

SUBMITTER: Allen CHG

PROVIDER: S-EPMC6544914 | biostudies-literature | 2019 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Leveraging heterogeneous data from GHS toxicity annotations, molecular and protein target descriptors and Tox21 assay readouts to predict and rationalise acute toxicity.

Allen Chad H G CHG Mervin Lewis H LH Mahmoud Samar Y SY Bender Andreas A

Journal of cheminformatics 20190531 1

Despite the increasing knowledge in both the chemical and biological domains the assimilation and exploration of heterogeneous datasets, encoding information about the chemical, bioactivity and phenotypic properties of compounds, remains a challenge due to requirement for overlap between chemicals assayed across the spaces. Here, we have constructed a novel dataset, larger than we have used in prior work, comprising 579 acute oral toxic compounds and 1427 non-toxic compounds derived from regulat ...[more]

PMID: 31152262

Similar Datasets

Project description:Cytotoxicity is a commonly used in vitro endpoint for evaluating chemical toxicity. In support of the U.S. Tox21 screening program, the cytotoxicity of ~10K chemicals was interrogated at 0, 8, 16, 24, 32, & 40 hours of exposure in a concentration dependent fashion in two cell lines (HEK293, HepG2) using two multiplexed, real-time assay technologies. One technology measures the metabolic activity of cells (i.e., cell viability, glo) while the other evaluates cell membrane integrity (i.e., cell death, flor). Using glo technology, more actives and greater temporal variations were seen in HEK293 cells, while results for the flor technology were more similar across the two cell types. Chemicals were grouped into classes based on their cytotoxicity kinetics profiles and these classes were evaluated for their associations with activity in the Tox21 nuclear receptor and stress response pathway assays. Some pathways, such as the activation of H2AX, were associated with the fast-responding cytotoxicity classes, while others, such as activation of TP53, were associated with the slow-responding cytotoxicity classes. By clustering pathways based on their degree of association to the different cytotoxicity kinetics labels, we identified clusters of pathways where active chemicals presented similar kinetics of cytotoxicity. Such linkages could be due to shared underlying biological processes between pathways, for example, activation of H2AX and heat shock factor. Others involving nuclear receptor activity are likely due to shared chemical structures rather than pathway level interactions. Based on the linkage between androgen receptor antagonism and Nrf2 activity, we surmise that a subclass of androgen receptor antagonists cause cytotoxicity via oxidative stress that is associated with Nrf2 activation. In summary, the real-time cytotoxicity screen provides informative chemical cytotoxicity kinetics data related to their cytotoxicity mechanisms, and with our analysis, it is possible to formulate mechanism-based hypotheses on the cytotoxic properties of the tested chemicals.

Project description:BackgroundThe number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources.ResultsWe explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned.ConclusionOur analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.

Dataset Information

Leveraging heterogeneous data from GHS toxicity annotations, molecular and protein target descriptors and Tox21 assay readouts to predict and rationalise acute toxicity.

Publications

Leveraging heterogeneous data from GHS toxicity annotations, molecular and protein target descriptors and Tox21 assay readouts to predict and rationalise acute toxicity.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets