Dataset Information

TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations.

ABSTRACT:

Background

Work on pharmacovigilance systems using texts from PubMed and Twitter typically target at different elements and use different annotation guidelines resulting in a scenario where there is no comparable set of documents from both Twitter and PubMed annotated in the same manner.

Objective

This study aimed to provide a comparable corpus of texts from PubMed and Twitter that can be used to study drug reports from these two sources of information, allowing researchers in the area of pharmacovigilance using natural language processing (NLP) to perform experiments to better understand the similarities and differences between drug reports in Twitter and PubMed.

Methods

We produced a corpus comprising 1000 tweets and 1000 PubMed sentences selected using the same strategy and annotated at entity level by the same experts (pharmacists) using the same set of guidelines.

Results

The resulting corpus, annotated by two pharmacists, comprises semantically correct annotations for a set of drugs, diseases, and symptoms. This corpus contains the annotations for 3144 entities, 2749 relations, and 5003 attributes.

Conclusions

We present a corpus that is unique in its characteristics as this is the first corpus for pharmacovigilance curated from Twitter messages and PubMed sentences using the same data selection and annotation strategies. We believe this corpus will be of particular interest for researchers willing to compare results from pharmacovigilance systems (eg, classifiers and named entity recognition systems) when using data from Twitter and from PubMed. We hope that given the comprehensive set of drug names and the annotated entities and relations, this corpus becomes a standard resource to compare results from different pharmacovigilance studies in the area of NLP.

SUBMITTER: Alvaro N

PROVIDER: S-EPMC5438461 | biostudies-literature | 2017 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations.

Alvaro Nestor N Miyao Yusuke Y Collier Nigel N

JMIR public health and surveillance 20170503 2

<h4>Background</h4>Work on pharmacovigilance systems using texts from PubMed and Twitter typically target at different elements and use different annotation guidelines resulting in a scenario where there is no comparable set of documents from both Twitter and PubMed annotated in the same manner.<h4>Objective</h4>This study aimed to provide a comparable corpus of texts from PubMed and Twitter that can be used to study drug reports from these two sources of information, allowing researchers in the ...[more]

PMID: 28468748

Similar Datasets

Project description:BackgroundDrug-induced suicide has been debated as a crucial issue in both clinical and public health research. Published research articles contain valuable data on the drugs associated with suicidal adverse events. An automated process that extracts such information and rapidly detects drugs related to suicide risk is essential but has not been well established. Moreover, few data sets are available for training and validating classification models on drug-induced suicide.ObjectiveThis study aimed to build a corpus of drug-suicide relations containing annotated entities for drugs, suicidal adverse events, and their relations. To confirm the effectiveness of the drug-suicide relation corpus, we evaluated the performance of a relation classification model using the corpus in conjunction with various embeddings.MethodsWe collected the abstracts and titles of research articles associated with drugs and suicide from PubMed and manually annotated them along with their relations at the sentence level (adverse drug events, treatment, suicide means, or miscellaneous). To reduce the manual annotation effort, we preliminarily selected sentences with a pretrained zero-shot classifier or sentences containing only drug and suicide keywords. We trained a relation classification model using various Bidirectional Encoder Representations from Transformer embeddings with the proposed corpus. We then compared the performances of the model with different Bidirectional Encoder Representations from Transformer-based embeddings and selected the most suitable embedding for our corpus.ResultsOur corpus comprised 11,894 sentences extracted from the titles and abstracts of the PubMed research articles. Each sentence was annotated with drug and suicide entities and the relationship between these 2 entities (adverse drug events, treatment, means, and miscellaneous). All of the tested relation classification models that were fine-tuned on the corpus accurately detected sentences of suicidal adverse events regardless of their pretrained type and data set properties.ConclusionsTo our knowledge, this is the first and most extensive corpus of drug-suicide relations.

Project description:BackgroundMany new medicines have been derived from natural sources such as plants, which have a long history of being used for disease treatment. Thus, their benefits and side effects have been studied, and plant-related information including plant and disease relations have been accumulated in Medline articles. Because numerous articles are available in Medline and are written in natural language, text-mining is important. However, a corpus of plant and disease relations is not available yet. Thus, we aimed to construct such a corpus.Methods and resultsIn this study, we designed and annotated a plant-disease relations corpus, and proposed a computational model to predict plant-disease relations using the corpus. We categorized plant and disease relations into four types: treatments of diseases, causes of diseases, associations, and negative relations. To construct a corpus of plant-disease relations, we first created its annotation guidelines and randomly selected 200 Medline abstracts. From these abstracts, we identified 1,405 and 1,755 plant and disease mentions, annotated to 105 and 237 unique plant and disease identifiers, respectively. When we selected sentences containing at least one plant and one disease mention, we extracted 878 plant and 1,077 disease entities, which finally generated a corpus of plant-disease relations including 1,309 relations from 199 abstracts. To verify the effectiveness of the corpus, we proposed a convolutional neural network model with the shortest dependency path (SDP-CNN) and applied it to the constructed corpus. The micro F-score with ten-fold cross-validation was found to be 0.764. We also applied the proposed SDP-CNN model to all Medline abstracts. When we measured its performance for 483 randomly selected plant-disease co-occurring sentences, the model showed a precision of 0.707.ConclusionThe plant-disease relations corpus is unique and represents an important resource for biomedical text-mining. The corpus of plant and disease relations is available at http://gcancer.org/pdr/.

Dataset Information

TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations.

Background

Objective

Methods

Results

Conclusions

Publications

TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets