Dataset Information

Learning predictive models of drug side-effect relationships from distributed representations of literature-derived semantic predications.

ABSTRACT:

Objective

The aim of this work is to leverage relational information extracted from biomedical literature using a novel synthesis of unsupervised pretraining, representational composition, and supervised machine learning for drug safety monitoring.

Methods

Using ≈80 million concept-relationship-concept triples extracted from the literature using the SemRep Natural Language Processing system, distributed vector representations (embeddings) were generated for concepts as functions of their relationships utilizing two unsupervised representational approaches. Embeddings for drugs and side effects of interest from two widely used reference standards were then composed to generate embeddings of drug/side-effect pairs, which were used as input for supervised machine learning. This methodology was developed and evaluated using cross-validation strategies and compared to contemporary approaches. To qualitatively assess generalization, models trained on the Observational Medical Outcomes Partnership (OMOP) drug/side-effect reference set were evaluated against a list of ≈1100 drugs from an online database.

Results

The employed method improved performance over previous approaches. Cross-validation results advance the state of the art (AUC 0.96; F1 0.90 and AUC 0.95; F1 0.84 across the two sets), outperforming methods utilizing literature and/or spontaneous reporting system data. Examination of predictions for unseen drug/side-effect pairs indicates the ability of these methods to generalize, with over tenfold label support enrichment in the top 100 predictions versus the bottom 100 predictions.

Discussion and conclusion

Our methods can assist the pharmacovigilance process using information from the biomedical literature. Unsupervised pretraining generates a rich relationship-based representational foundation for machine learning techniques to classify drugs in the context of a putative side effect, given known examples.

SUBMITTER: Mower J

PROVIDER: S-EPMC6454491 | biostudies-literature | 2018 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Learning predictive models of drug side-effect relationships from distributed representations of literature-derived semantic predications.

Mower Justin J Subramanian Devika D Cohen Trevor T

Journal of the American Medical Informatics Association : JAMIA 20181001 10

<h4>Objective</h4>The aim of this work is to leverage relational information extracted from biomedical literature using a novel synthesis of unsupervised pretraining, representational composition, and supervised machine learning for drug safety monitoring.<h4>Methods</h4>Using ≈80 million concept-relationship-concept triples extracted from the literature using the SemRep Natural Language Processing system, distributed vector representations (embeddings) were generated for concepts as functions o ...[more]

PMID: 30010902

Similar Datasets

Project description:IntroductionAs a result of the well documented limitations of data collected by spontaneous reporting systems (SRS), such as bias and under-reporting, a number of authors have evaluated the utility of other data sources for the purpose of pharmacovigilance, including the biomedical literature. Previous work has demonstrated the utility of literature-derived distributed representations (concept embeddings) with machine learning for the purpose of drug side-effect prediction. In terms of data sources, these methods are complementary, observing drug safety from two different perspectives (knowledge extracted from the literature and statistics from SRS data). However, the combined utility of these pharmacovigilance methods has yet to be evaluated.ObjectiveThis research investigates the utility of directly or indirectly combining an observational signal from SRS with literature-derived distributed representations into a single feature vector or in an ensemble approach for downstream machine learning (logistic regression).MethodsLeveraging a recently developed representation scheme, concept embeddings were generated from relational connections extracted from the literature and composed to represent drug and associated adverse reactions, as defined by two reference standards of positive (likely causal) and negative (no causal evidence) pairs. Embeddings were presented with and without common measures of observational signal from SRS sources to logistic regressors, and performance was evaluated with the receiver operating characteristic (ROC) area under the curve (AUC) metric.ResultsROC AUC performance with these composite models improves up to ≈ 20% over SRS-based disproportionality metrics alone and exceeds the best prior results reported in the literature when models leverage both sources of information.ConclusionsResults from this study support the hypothesis that knowledge extracted from the literature can enhance the performance of SRS-based methods (and vice versa). Across reference sets, using literature and SRS information together performed better than using either source alone, providing strong support for the complementary nature of these approaches to post-marketing drug surveillance.

Project description:Although there are many studies of drugs and their side effects, the underlying mechanisms of these side effects are not well understood. It is also difficult to understand the specific pathways between drugs and side effects. The present study seeks to construct putative paths between drugs and their side effects by applying text-mining techniques to free text of biomedical studies, and to develop ranking metrics that could identify the most-likely paths. We extracted three types of relationships-drug-protein, protein-protein, and protein⁻side effect-from biomedical texts by using text mining and predefined relation-extraction rules. Based on the extracted relationships, we constructed whole drug-protein⁻side effect paths. For each path, we calculated its ranking score by a new ranking function that combines corpus- and ontology-based semantic similarity as well as co-occurrence frequency. We extracted 13 plausible biomedical paths connecting drugs and their side effects from cancer-related abstracts in the PubMed database. The top 20 paths were examined, and the proposed ranking function outperformed the other methods tested, including co-occurrence, COALS, and UMLS by P@5-P@20. In addition, we confirmed that the paths are novel hypotheses that are worth investigating further. The risk of side effects has been an important issue for the US Food and Drug Administration (FDA). However, the causes and mechanisms of such side effects have not been fully elucidated. This study extends previous research on understanding drug side effects by using various techniques such as Named Entity Recognition (NER), Relation Extraction (RE), and semantic similarity. It is not easy to reveal the biomedical mechanisms of side effects due to a huge number of possible paths. However, we automatically generated predictable paths using the proposed approach, which could provide meaningful information to biomedical researchers to generate plausible hypotheses for the understanding of such mechanisms.

Dataset Information

Learning predictive models of drug side-effect relationships from distributed representations of literature-derived semantic predications.

Objective

Methods

Results

Discussion and conclusion

Publications

Learning predictive models of drug side-effect relationships from distributed representations of literature-derived semantic predications.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets